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PREFACE 


For this Ninth International Conference on Parallel Processing 
we received a total of 117 papers, 31 of which were from 6 countries in 
Europe, Canada, Israel, Japan, and the People's Rasublic of China. 
Sixty-five papers were accepted for presentation at the meeting, 21 of 
which are to be presented in a one- and one-half hour poster session. 

In a poster session visual displays of all the papers are mounted on 
bulletin boards, and the author of each paper is present during pie 
entire session for explanation and in-depth discussion with interested 
persons. This session allowed us to accept more interesting papers than 


would have been otherwise possible. 


The conference featured a film festival covering the history of 
and advances in computer architecture, and a panel session addressing 


the outstanding issues of designing high performance computer systems. 


We would like to thank Tse-yun Feng, the conference chairman, for 
arranging the location of this meeting, and printing and distributing the 
preliminary announcements. We are indebted to Mrs. Vivian Alsip for her 
valuable help in keeping all the correspondence to the authors and 
reviewers superbly organized. We also extend our thanks to Ms. Gerrie 
Katz of the IEEE Computer Society for her patience and help in producing 
this proceedings. Finally, we thank Tse-yun Feng and K.H. Kim for 
handling the papers by Banerjee, Gajski, and Kuck, and Lawrie and Vora. 

PROGRAM COMMITTEE 
David J. Kuck 


Duncan H. Lawrie 
Ahmed H. Sameh 
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SESSION 1: SOFTWARE AND LANGUAGES 


A PARALLEL OPERATING SYSTEM FOR AN MIMD COMPUTER 


Rodney A. Schmidt 


Denelcor, 


Inc. 


Denver, Colorado 80205 


Summary 


The HEP computer system developed by 
Denelcor, Inc. under contract to the U.S. Army 
Ballistics Research Laboratory is an MIMD machine 
of the shared resource type as defined by Flynn 

1}. The architecture of this machine has been 
covered earlier in a paper by Smith [2] . Briefly, 
processes in HEP reside within tasks, which de-_ 
fine both a protection domain and an activitation 
state (dormant/active). Tasks reside within 
processors, all of which access a shared data 
memory. Multiple tasks may cooperate by sharing 
a common region in data memory. Cells in data 
memory have the property of being ''full'' or 
"empty'' and the execution of instructions in 
processes may be snychronized by busy waiting (in 
hardware) on the full/empty state of data memory 
cells. Other than the state of data memory, 
processes and tasks tn different processors have 
no means of synchronization or communication. 


High-level language (e.g. FORTRAN) programs 
in this machine are explicitly parallel. Sub- 
programs are made to run in parallel with the 
main program by an explicit CREATE statement 
analogous to CALL in ordinary FORTRAN. Code 
within a subprogram is SISD. The objective of 
the HEP operating system is to preserve the 
parallelism of the user program by executing in 
parallel during the performance of 1/0 and re- 
lated supervisory functions. The operating sys- 
tem must: 


1.) Allow all user processes to execute 
during I/0 related supervisory 
computation; 


2.) Allow multiple concurrent supervisory 
|{/0 computations; 


3.) Allow reentrant use of code in the 
supervisor and the user program; 


4.) Provide maximum user performance by 
consuming minimum resource in both time 
and space. 


In SISD computers, reentrancy is usually obtained 
with some form of dynamic memory allocation. 
Concurrency of the operating system and the user 
is not possible due to the SISD nature of the 
machine. | 


In HEP, most dynamic memory allocation would 
generate considerable serialization of code 
around the resource lock required to safeguard 
the memory allocation data structure. In 
addition, HEP cannot allow any memory used by the 
system to be writeable by the user since the 
user is running truly in parallel with the sys- 
tem and could destroy any location at any time. 
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In the HEP operating system, the available 
general purpose registers (about 2,000 of them) 
are divided a priori into groups of uniform 
length. When a process is created, the creating 
process must obtain a register environment from 
a table of available groups. This operation is 
relatively infrequent and inexpensive. All 
register environments are identical, and no state 
is retained in them. 


Main memory (data memory) environments are 
obtained at the subprogram level by each sub- 
program as it is invoked. Space is obtained from 
a pool of data memory environments peculiar to 
that subprogram. The user must specify at link 
time how many such environments should be 
allocated for each subprogram. Control of an 
environment is obtained via a table of free 
environments, but the table is local to the sub- 
program. Thus, serialization for access to an 
environment is only between multiple, nearly 
simultaneous, invocations of the same subprogram, 
and is much less damaging to performance. 


Data memory environments are a resource not 
visible to the user, and as such can contribute 
to deadlock problems. Given the user's ability 
to increase the amount of data memory resource 
allocated to a subprogram, the deadlock problem 
can be circumvented without much difficulty. 


Concurrent 1/0 presents its own set of 
problems. In FORTRAN, a single I/0 is implemented 
with multiple calls to 1/0 formatting services. 
State must be retained by the formatter during 
this process. This state is bound to the 1/0 
unit, not the subprogram. Further, the amount 
of space required is not known until run time. 
Thus, some type of run time memory management is 
required, and the resource thus allocated is 
invisible to the user. The space must be allo- 
cated in an area accessible to all processors 
in a multi-processor job, so that all tasks may 
share the same I/0 units. 


The strategy employed in HEP is to allocate 
1/0 buffers for a logical unit upon the first 
1/0 to the unit. The space is then consumed for 
the duration of the program, even if the [/0 unit 
is closed. If the !/0 unit is re-opened for 
another file, the record length of the new file 
must be less than or equal to that of the old 
file. In this implementation, space can be 
allocated from a top-of-memory pointer which 
moves in only one direction. Serialization of 
processes occurs only on simultaneous first 1/0 
operations, and only for the few microseconds 
required to move the pointer. This contrasts 
with the substantial serialization introduced 
by the normal scheme of a linked list of avail- 
able space with garbage collection. 


Consideration is being given to allowing a 
user to supply his own logical record buffer, 
with only the fixed portion of the I/0 state held 
at the top of memory. This would allow the user 
greater dynamism in the logical record size, at 
the expense of managing his own resources. 


HEP supervisors require two types of 
dynamic memory: registers to use while copying 
logical records to/from physical records, and 
data memory to hold file parameters for open 
files. Of these, the register allocation is the 
simplest. Since the users register requirement 
can be determined from the number of processes 
requested (a control card parameter), all re- 
maining registers in the register memory parti- 
tion can be used for supervisor 1/0 operations. 
These registers are allocated from a bit table 
to active |/0 operations. ' 


Data memory allocation is more difficult. 
It is not known until run time how many files 
will be used, or how much logical record buffer 
Space will be required by the user. Fortunately, 
the amount of supervisor space required per open 
file is constant. The operating system merely 
allocates supervisor space for enough files to 
accomodate the larger system programs 
(compiler, etc.) and leaves the remaining space 
for the user. The default limit on open files 
may be overridden with a control card for users 
with special requirements. 


The present HEP system provides a high- 
performance low overhead environment for parallel 
computational activities. Our next activity will 
be to extend this capability with high- 
performance parallel I/0 operation with speed 
comparable to our processing speeds. The 
parallel file system will include such features 
as record interlock within files and concurrent 
read/write capability from multiple jobs to the 
same file. 
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THE PROGRAMMING LANGUAGE 
PARALLEL PASCAL 


by 


Anthony P. Reeves, John D. Bruner, and Mark S. Poret 


School of Electrical Engineering 
Purdue University 
West Lafayette, Indiana 47907 


summary 


An extended version of the Pascal programming 
language for Parallel processors is described. 
This language reduces the semantic gap between the 
very popular sequential Pascal language anda 
large group of highly structured paraktlel proces- 
SOrs. Only a- small number of carefully chosen 
features have been added to the conventional Pas- 
cal language. A specification of the language is 
given in [1]. 

Most parallel processors are currently pro- 
grammed in either assembly language or a machine- 
dependent special version of Fortran. In some 
cases, an attempt has’ been made to implement a 
sequential high level language on a parallel pro- 
cessor. This. may work well on a tightly-coupled 
processor with a small number of processing ele- 
ments (PE’s). The advantage is that existing pro- 
grams may be used without change and that program- 
mers do not have to learn anything new. Unfor- 
tunately, sequential languages are often unsuit- 
able for the expression of array manipulations and 
efficiency is lost. By contrast, since Parallel 
Pascal has been designed for SIMD processors, it 
is a high level language offering efficiency, por- 
tability, and error detection and diagnosis facil- 
ities. 

Parallel Pascal primitive operators are _ based 
on the instructions available on Parallel Matrix 
Processors (PMP’s), a class of highly structured 
parallel processors involving a large number of 
PE’s with a limited PE interconnection scheme. 
Two examples of PMP’s are the MPP [2] and BASE 
[3]. It should be efficiently implementable on a 
very wide range of architectures, including vector 
and pipeline processors. However, the cost for 
this portability is that many powerful features of 
particular parallel processors may not be made 
easily available as operators. As a result, it 
may be necessary to perform some simple reformula- 
tion of algorithms to achieve optimum efficiency 
when transporting programs. 

Parallel Pascal is not simply implementable on 
an MIMD processor. However, a program written in 
Parallel Pascal can be divided more easily into 
subtasks than an equivalent conventional Pascal 
program. | 


The prime objective of developing a parallel 


processor is to achieve high speed execution; — 


The work reported here was funded by NASA-Goddard 
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therefore an efficient programming of any problem 
is essential. The extensive error checking avail- 
able with conventional Pascal impairs the effi- 
ciency of the program execution unless the paral- 
lel processor contains error checking hardware 
(which is usually not the case). An implementa- 
tion of Parallel Pascal should provide for _ the 
generation of code without runtime error checking 
once programs have been debugged. In addition, an 
implementation should provide for the inclusion of 
assembly language segments for critical sections, 
most likely as externally called procedures. 

A translator program has been written for a 
subset of Parallel Pascal which will translate a 
Parallel Pascal program into a conventional Pascal 
program; the translator itself is written in Pas- 
cal. The design of Parallel Pascal is being re- 
fined as experience is gained with writing practi- 
cal Parallel Pascal programs and running them via 
the translator. A Parallel Pascal compiler for 
the MPP is being developed. 

The two principal goals of the Pascal program- 
ming language were to make available a language 
for teaching systematic, structured programming 
and to develop reliable, efficient implementations 
on presently available computers [4]. The result- 
ing language is based on Algol 60 and has a richer 
set of program control structures and data strucy 
tures (types). 

The goal of implementability was achieved by 
considering how to simply compile the language 
when it was designed. The structure of the 
language was chosen so that a simple parsing algo- 
rithm could be used [5]. Unfortunately, the goal 
of simplicity has led to a few deficiencies which 
should be remedied in future language revisions. 
One serious deficiency for Parallel Pascal is the 
lack of dynamic arrays array dimensions may 
only be specified by constants. This provides 
simplicity and strict typing, but makes it very 
difficult to write a library of functions for gen- 
eral array operations. 

A .special version of Pascal with operating sys- 
tem features, called Concurrent Pascal [6] has al- 
ready been developed by Per Brinch Hansen. In a 


sense, Concurrent Pascal reduces the semantic gap 
between a user Pascal program and the total com 
puter environment including the supervisor mode 


and operating system. In Parallel Pascal an at- 
tempt is made to reduce the semantic gap between 
the Pascal language and parallel processor archi-~ 
tectures. 

In Parallel Pascal a set of standard functions 


for general array manipulations will be intro- 
duced. All standard functions will be defined for 
any size arrays; this is consistent with the Pas-— 
cal concept of standard functions operating on 
more than one data type. User defined procedures 
and functions will be limited to a single array 
size. 

Parallel Pascal is characterized by the follow- 
ing extensions to Pascal: 


(a) Arrays to be manipulated by the parallel pro- 
cessor may be explicitly declared as such by 
the word parallel, e.g. 


a,b,c: parallel array [1..8,1..8] of type 


(b) Expressions may involve entire arrays; also, 
functions may return entire arrays, e.g. 


a := b+ sin(c) + 3 
means 
ali,j] := bli,j] + sin(c{i,j]) +3 ¥ i,j 


(c) All control statements may have arrays for 
control variables, e.g. 


if ADB then C := 3 
means 
if A[i,j]>Bli,j] then Cli,j] := 3 ¥ i,j 


(d) A new set of standard functions are available 
for entire array manipulation. These func— 
tions are defined for all array sizes and 
types. 


shift(array, Sl, S2, ..., Sn) 
rotate(array, Sl, S2, ..., Sn) 


The shift function moves the data in the 
amounts specified by the integers Sl] ... Sn 
(one S for each dimension of the array). 
Null values are inserted at the edges of the 
array.e The rotate function is similar to 
shift except that the data shifted in at one 
edge of the array is the data shifted out of 
the opposite edge of the array. 


expand(array, dimension, size) 


The expand function replicates the array 
along a new dimension size times. 


transpose(array, Dl, D2) 


This transposes an array about the two given 
dimensions Dl and D2. If only one dimension 
is specified then the data is "flipped" about 
that dimension. ; 


There are also several functions which 
apply a reduction operator over all of the 
specified dimensions. 


general format: fn(array,D1,D2,...,Dn) 


asum arithmetic summation 
aprod product 

aand logical and 

aor logical or 

amax maximum value 

amin minimum value 


For example, the sum of all elements in a ma- 
trix M is specified by asum(M,1,2) and a vec- 
tor containing the maximum values of each row 
of M is specified by amax(M,2). 


(e) For convenient input and output of parallel 
array data the procedures read and write have 
been extended so that a whole array may be 
read on written. The capability of reading 
or writing a subarray of a large array file 
may be added later. 


(f) The index for a Parallel Pascal array may be 
scalar, elided, a logical vector or a set. A 
scalar index selects one item in a dimension 
and reduces the rank of the result by one. 
An elided index specifies all items in that 
dimension. A subset of items. in a dimension 
may be specified by either a set or a logical 
vector. The logical vector must be the same 
length as the dimension it indexes. 


Parallel Pascal also has a bit indexing mechan- 
ism for the low level programming of bit-serial 
parallel processors. This mechanism is outside 
the normal usage of the language; however, its 
availability may make it possible to avoid using 
assembly code for low level bit serial operations. 
This feature is, in general, not portable between 
different implementations as the bit representa- 
tion of numbers is machine dependent. 


References 


1. Reeves, Ae P., Bruner, J., and Poret, M., 
"The Programming Language Parallel Pascal", 
Internal Purdue Electrical Engineering re- 
port, 1980. 


2. Batcher, Ke, "MPP -- A Massively Parallel 
Processor," Proceedings of the 1979 
International Conference on Parallel 


Processing, 1979, p. 249. 


3. Reeves, A~ P., "A Systematically Designed 
Binary Array Processor,’ IEEE Transactions on 
Computers, April 1980. 


4. Jensen, Ke. and Wirth, N., "Pascal User’s 
Manual and Report," Springer Verlag, New 
York, 1974, p. 133. 


5. Wirth, N., "The Design of a Pascal Compiler," 
Software-Practice and Experience, Vol. l, 
1971, p. 320. 7 


6. Brinch Hansen, P., "The Programming Language 
Concurrent Pascal", IEEE Transactions on 
Software Engineering, Vol. SE-1, No. 2, June 
1975, pp. 199-207. 


Decomposing a Program for Multiple Processor Systems 


Massachusetts Institute of Technology 
Laboratory for Computer Science 
545 Technology Square 
Cambridge, Mass. 02139 


Abstract 


The success of high performance multiple processor systems depends upon our ability to decompose a 
program into small segments suitable for execution on one processor. It is argued in this paper that purely 
applicative languages are better suited for parallel processing because they offer considerable advantage over 
Fortran-like languages in program transformation and decomposition. A scheme for decomposing applicative 


programs is described through examples. 


1 Introduction: 


The operation of a multiple processor system designed to 
increase the execution speed of a single program can be viewed at 
two levels. At the macroscopic level the system carries out the 
computations of the user’s high-level program. At the microscopic 
level cach processor executes its own set of instructions and 
exchanges data with other processors as needed. Implementing a 
program on such a system requires transforming the high level 
description into a set of programs for the individual processors. 


Work at the University of California, Irvine has shown how 
high-level dataflow programs can be mapped onto a set of 
asynchronously cooperating processors as the computation unfolds 
dynamically [4,10]. For applications such as partial differential 
equation simulation, however, the cost and overhead of fully 
gencral, dynamic mapping may be unwarranted. These 
applications are characterized by extremely high computational 
requirements and simple and regular program and data structures 
[3]. Hence a static mapping of activitics onto processors may prove 
more efficient and cost effective without creating an undue loss of 
flexibility. Furthermore, a static mapping scheme for these 
problems could distribute activitics and data structure clements 
over the processors in such a way that information flow is highly 
localized. ‘This would allow a simpler, lower cost interconnection 
network than is required to achieve high performance with 
dynamic mapping. The effectiveness of static mapping of activities 
is directly related to the decomposability of a program. 


In this paper we first discuss the appropriateness of purely 
applicative languages for parallel processing and then give a 
scheme for decomposing applicative programs for multiple 
processor systems. It is assumed that each processor is capable of 
storing its own program and data and can communicate with any 


other processor in the system. ‘The internal organization of a_ 


processor docs not affect our scheme. 
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2 Impact of High level language on 
Decomposability: 


Traditionally, Fortran and its extensions have been regarded as 
the only acceptable high-level languages for high performance 
systems such as CRAY-1, STAR-100 and IIliac IV. The main 
reason for programming in Fortran is to maintain compatibility 
with a large existing body of scientific software. This compatibility 
is of little use in practice because existing Fortran programs do not 
show significant performance gains on new machines with different 
architectures. In fact, a recoding of parts of the program either in 
machine language or in some new extension of Fortran is required 
to achieve high performance. Software tools such as vectorizing 
and optimizing compilers have been successful on a very limited 
class of Fortran programs, namely those programs that do not have 
undesirable -"side-effects” (see [1] for an in-depth discussion of 
side-effects). In a maximally parallel program, statement 
executions are ordered only by data dependencies. It is difficult to 
detect data dependencies in a Fortran (in fact in any imperative 
language) program due to accesses to global variables and 
operations on data structures. A programmer often tries to 
minimize storage by reusing the same array over and over again. 
This further complicates detection of minimal data dependencies. 
A ban on the use of global variables or arrays seems absurd in view 
of the fact that for efficiency, a clever Fortran programmer often 
passes parameters to subroutines through common declaration. 
Potential side-effects of common declarations in Fortran are so 
intricate that most optimizing compilers will not optimize across 
subroutines. 


Kuck and his associates [12, 13, 14] have studied and classified a 
large number of Fortran programs in an attempt to identify 
features that should be supported by high performance systems. 
They have given various statement execution ordcrings that 
potentially could be exploited by a compiler for a multiple 
processor machine or an array or pipcline machine. Various 
transformations of the source program are also suggested to 
enhance parallelism in a program. Even though the usefulness of 


studying existing. algorithms! for designing high performance 
architectures is undeniable, we take issue with Kuck’s acceptance of 
the adequacy of Fortran or its extensions. An efficient multiple 
processor architecture cannot be developed unless it supports the 
execution model of a parallel processing language systematically. 


Our approach to program decomposition may be applicable to a 
class of Fortran programs structured along certain guidelines. 
However the basis of our programming is so radically different 
from Fortran that syntactic compatibility with Fortran is of little 
use. If a Fortran program has to be rewritten for a high 
performance architecture then it can just as well be written in a new 


language. We hope that eventually parallel. processing languages 


will remove the constraints placed by Fortran-like languages on our 
thinking, encouraging us to develop yet faster algorithms. , 


3 Applicative Programming for Parallel 
Processing: 7 | 


A. parallel or asynchronous programming language for a 
multiple processor system should not incorporate the concept of an 
updatable storage ccll [4, 8]. This is essential to avoid complex 
synchronization mechanisms and _ elaborate sequencing of 
operations. When all computation is based on values, as opposed 
to addresses where values are kept, the possibility of a race to read 
or write is not possible. The two most widely known languages that 
_can support pure applicative (i.e., functional) programming are 
LISP and APL. However, both have such different syntax from 
conventional languages that the cffort involved in learning cither of 
them is quite substantial. The difficulty is further compounded by 
the fact that both LISP and APL also present entirely different 
programming paradigm. One has to almost unlearn Fortran 
programming to be able to think clearly in either of these 
languages. LISP duc to its recursive nature and strange syntax is 
treated by scientific programmers as an amusing diversion for. 
academicians. The inefficiency of these languages on conventional 
architectures also lends support to Fortran adherents. 


We think that the “syntax” problem of applicative languages is: 
completely solvable. Two languages, Id[4}] and VAI_[2], currently 
under development at the University of California at Irvine and 
MIT provide a syntax as well as a programming paradigm that is 
superficially quite similar to Algol—Pascal family of languages. 
‘Both of these languages are purely applicative, and we believe that 
a programmer familiar with Algol can learn Id in a few days. 


Generally, an applicative language such as LISP allows the 
creation and use of data structures in a much more dynamic 
manner than Fortran. Hence a fair comparison of their efficiency is 
difficult. However, for most numerical algorithms this expressive 
power of applicative languages is not required. An applicative 
language with as restrictive a control and data structure as Fortran 
-may still be Iess efficient than Fortran on a sequential computer. 
However, for a multiple processor machine the efficiency of a high 
level language will depend on the availability of program 
decomposition schemes, and due to this fact applicative languages 
may indecd turn out to be more efficient than impcrative languages 


. 1We prefer to study algorithms over programs because algorithms are more 
language independent. 


for sucli machines. A consensus scems to be emerging on this point 
(6,9, 11]. : 


The problem of decomposition can also be viewed as an exercise 
in program transformation. A fair amount of work has already 


“been done on transforming applicative programs (see [7] for 


example). We illustrate the flexibility for decomposition provided 


by an applicative program through an example. Consider a 


classical relaxation algorithm in one-dimension. One computes the 
new values of the x clements repeatedly using the following 
equation. 


newX; = (x,.) +X) +%4)/3. Igign 
where x, and x, , ) remain constant. 


A. straightforward Fortran program would do this in the 
following way. 


C XIS AN ARRAY OF N+2 ELEMENTS 
C X(1) AND X(N+2) REMAIN CONSTANT 
N1=N+1 
DO 20 K=1, KMAX 
DO 10 f=2, N1 
Y(1) = (X(I-1) + X(D) + X(14+1))/3. 
10 CONTINUE 
DO 15 1=2, N1 
X(D) = YO) 
15 CONTINUE 
20 CONTINUE (1) 


A compiler can easily generate good code for a multiple processor 
machine from the above program. Even if a programmer is clever, 
and avoids copying array Y into X by switching back and forth 
between X and Y, a vectorizing compiler will be able to deal with it 
effectively. However, ifarray X is large, and a programmer decides 
to avoid using another array Y altogether, the following program 
may result. 


NI=N+1 
DO 20 K=1, KMAX 
T1l=X(1) 
T2=X(2) 
DO 10 [=2, N1 
X(D=(T14+7T2 + X(1+1))/3. 
TL=T2 
T2=X(I+1) 
10 CONTINUE 
20 CONTINUE (2) 


It’ would be extremely difficult for a compiler to detect a 
transformation in which all the elements of array X are relaxed 
simultaneously. 


On parallel computers, programmers use the trick of relaxing 
only half the elements (i.c., odd or even) in one iteration to avoid 
excessive use of storage. [It should be noted that the algorithm for 
relaxing odd and even elements alternatively is an entircly different 
one, and requires mathematical sophistication on the part of a 
programmer to prove its stability. : 


Now we contrast this situation with an applicative program 
written in Id. 


(for k from ] to kmax do 
new x + (initial y — <O:lb,n+]:rb> 
!1b and rb represent the boundary values at 
selectors 0 and n+] respectively! 
for i from 1 ton do 
new y[i}] — (x[i-1] + xfi] + x[i+ 1/3. 
return y) 
return x) | (3) 


We rely on readers intuition to understand the control structure of 
the above Id program. Manipulation of arrays (i.e. an example of 
structures) in applicative languages needs some explanation. One 
thinks of every array construction operation (such as append) as 
producing a new array. Hence append (a,1,v) produces a new array 
a’ which differs from a only in position i. Even though new y[i] —... 
looks like a conventional assignment statement, y[i] docs not refer 
to a storage cell. Rather one should think. of the whole array as a 
value, and y{i] as referring to a part of the value. Naturally if one 
changes a part of a value the aggregate value changes too. In this 
example since 7 /s faken from an unrepeated set of values (i.e., 1 to n) 
it is possible to regard y as an I-structure [5]. In contrast to ordinary 
Structures, an clement of an I-structure can be used as soon as it is 
created. Thus I-structures allow greater freedom in manipulating 
programs for efficient execution on a parallel computer. 


Using a vectorizing compiler it is as casy to gencrate code for a 


multiple processor machine from this Id program as it was with the . 


first Fortran program. However, the same Id program allows us to 
generate code that may overlap several itcrations of the outer loop. 
Note that since y is an I-structure, the k+ 1% iteration of the outer 
loop can begin as soon as the first three elements of x from the qh 
iteration have been computed. If we desire we can easily derive 
implementations of this Id program that will use the same 
minimum amount of storage as the second Fortran program and 
still allow concurrent execution of several iterations (see Figure 1). 


Our premise is that a high level language should permit coding of 


algorithms to show the maximal parallelism inherent in an 
algorithm. Such languages have to be purely functional in nature. 
The task of decomposing and transforming maximally parallel 
programs for a parallel machine is considerably simpler than the 
task of decomposing Fortran programs. In the rest of this paper we 
will outline a scheme for decomposing Id programs for multiple 
processor machines. The scheme will be described through 
examples. 


4 Decomposition Scheme: 


Applicative programs that have loops as their primary control 
structure and that operate on bounded-size data structures can be 
decomposed into programs for a set of individual processors in 
three steps: 


1. The nested loop structures are unrolled into a network — 
of computation cells. 


2. Data structure elements are assigned to the cells. 


3. The network of computation cells is mapped onto the 
actual processors of the system, according to the size 
and structure of both the network and the computer 
system. 


A computation cell can be regarded as a virtual processor to 
which a program and local data has been assigned. However, the 
virtual processor program may also refer to data that is not local, in 
which case a communication between this virtual processor and the 
virtual processor holding the data takes place. We will use 
programs written in Id language to illustrate the decomposition 
scheme. All expressions in Id have the property that for every set 
of inputs received they must produce exactly one set of outputs. 
Duc to this property, the communication between computation cells 
is highly structured and its pattern can be determined a priori. In 
order to remain consistent with the data-driven nature of Id, we 
assume, without loss of generality, that a non-local value is sent to, 
rather than demanded by, a computation cell. We can draw a 
directed link from the cell that sends a valuc to the cell that receives 
it, and thus a network of virtual processors can be created. If an 
unbounded number of processors were available and if these 
processors could be intcrconnccted in any desired pattern, then an 
ideal network topology for the physical system would be the 
topology of the computation cell network. 


4.1 Defining Cells of Computation: 


A programmer defines a cell by specifying what task is to be 
carried out by it. For example a task may be defined as the work 
done in the i" iteration of a loop, hence by unfolding a loop a 
number of computation cells may be defined. A program for the 
task carricd out in the i” iteration of an Id loop can be gencrated 
automatically. There is in general more than one computation cell 
definition possible as we show below. Consider the following 
program for conventional matrix multiply algorithm. 


procedure matrix_multiply (a, b, 1, m, n) 
! multiply matrix a of dimensions 1 X m by matrix b 
of dimensions m X n! 
(initial c— < >!<> represents an empty structure! 
for i from 1 to 1 do , 
new C[i] <~ (initial dt— <> 
for j from 1 to n do 
new d{j] — 
(initial s+—0 
_ for k from 1 to m do 
news «s + a[i,k]*b[{k,j] 
return s) 
return d) 
return c) (4) 


In Id a matrix is represented by an one-dimension array of 
one-dimension arrays. Hence output matrix c is constructed by 
appending together | rows, each represented by an array d. We note 
that this Id program when exccuted under the U-interpreter [5] can 
automatically carry out all the 1 X m X n multiplications in 
parallel. This effect is achicved without any global analysis of the 
program. As stated earlier the attempt in this paper is to perform 


certain functions of the U-interpreter at compile time and hence 
map concurrent activities statically onto processors. 


Suppose we specify computation cells to carry out one iteration 
of the loop with index variable i. The network of Figure 2 will be 
produced by unfolding loop i. The program for the it” cell can be 
written as follows. 


append(c;-;, i, d) where d is computed as follows 
d+ (initial d —<> 
for j from 1 to ndo 
new d[j] — (initial s — 0 
for k from 1 to m do 
news +s + afi,k] * b[kj] 

return s) 

return d) 


The subscript on a variable (i.e. c;.,) indicates the cell where that 
variable will be computed. This program is valid for cell numbers 1 
to 1. The computation cell 0 should produce an empty array < >, 
and the result should be available in cell 1+1. 


The definition of computation cells can be critical to exploiting 
parallelism in a program. In the matrix multiply procedure if 1 is 
much smaller than the actual number of processors available then 
unfolding cither both loop i and loop j, or only loop j may be more 
advantageous than unfolding only loop i. The network produced as 
a result of unfolding loop i and loop j of program (4) is shown in 
Figure 3. | 


There is no obvious advantage in unfolding the outer loop of 
program (3) for the relaxation algorithm. Computation cells 
produced in this manner will execute essentially in a sequential 
order. However, if loop i is unfolded concurrent relaxation of all 
the elements of x is possible. The process of unfolding an inner 
loop without unfolding the outer loop in a nested loop structure is 
somewhat tricky. The result of unfolding loop i of program (3) is 
shown in Figure 4. The cell programs are given below. 


for cell 0 
Yo + <O:1b , n+ Lirb> 
forcell l<i<n 


y; + append (y;.;, i, t) where t is computed as follows 
t— (x, 4 fi-1) + X_4 fi] + x44 fit 1D/3. 


Note that as before x, , means that x is defined in cell n+1. 
The program for cell n+1 is 7 


(for k from 1 to Kmax do 
new x — y,(x) 
return n) 


The meaning of y,(x) is that output from cell n is needed but it can 
only be obtained after x is supplicd. In a dataflow interpretation 
the value array x is sent by cell n+1 to all the relevant cells as soon 
as x is produced. For every x value that cell n+1 outputs it 
receives an input value y, which becomes the new value of x. The 
initial value of x has to be given to cell n+1 to start the 
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computation. Such an input is implicit in program (3). 


A reader at this point may consider the computation cell 
network of Figure 4 to be very wasteful because it makes many 
copies of array x. Indeed this is what we wish to avoid by mapping 
structures on computation cells. 


4.2 Mapping Data Structures on Computation Cells: 


Once computation cells have been defined, a mapping of data 
structures (i.¢., matrices and arrays) onto these cells can be 
specified. For example a programmer may spccify that x[i,j] for all 
j should be mapped onto cell k. Mappings may not be one to one. 
Consider again program (3) with the inner loop unfolded. A 
mapping that seems quite sensible is that elements x{i-1], x[i] and 
x{i+ 1] be mapped on to cell i. This mapping assigns each element 
of x to three computation cells. Treating a data structure as 
collection of elements--each one of which is assigned to a cell or 
cells--climinates the select operation on data structures. Suitable 
mappings to reduce communication due to the select operation are 
Straightforward to derive but unfortunately solve only half the 
problem. 


Consider a mapping in which cach x[i] is mapped onto only cell 
i. The program for cell ican be expressed as follows: 


y; + append(y;.;, i, () where t is computed as follows 
t+ (xfi-2];.) + xfi] + xfit 1}, ))/3. 


where each x{j] should be treated as one value and the meaning of 
x[j], is as usual that x[j] is defincd in cell k. 


The value t computed by cell i becomes part of array y, and is 
passed on to cell i+ 1 which passes it to cell i+2 and so on. It 
finally reaches cell n+1 asa part of the value y,. ‘The new value of 
x is y,, and it is y, that is distributed to cells 1 to n. Hence the xfi] 


that cell i receives is in fact the last t computed by cell i. This makes 


the whole process of constructing x and then distributing it seem 
unnecessary. Every cell should compute t and store it for the next 
calculation of t. It must still communicate the value of t to cells i-1 
and i+1 in order for these calls to compute their t valucs but most 
of the communication from cell n+1 to cell i will be avoided. 
There has to be some communication from cell n+1 to cell i to 
indicate if the computation has terminated or not (i.c., k>kmax?). 
Figures 5.1 and 5.2 depict the effect of simplification achieved by 
this data structure mapping. 


In order to achieve the simplification suggested by Figure 5.2 we 
have to be able to determine the cell where a particular element of 
a Structure is generated. This can be done casily in program (3) 
once we note that no element of y in the inner loop is ever 
redefined that is, y is an I-structure. As noted earlier an element 
belonging to an I[-structure can be distributed as soon as it is 
generated. | 


For the kind of programs we are interested in, the sclector for 
the append opcration is often directly and simply related to the 
loop index. If the loop index is taken from an unrepeated set the 
condition of I-structures is automatically met. Now we give a rule 


for determining the number of the ce// where the clement of a data 
structure is generated in such cases. 


Suppose we want to find the cell number where element cfi,j] is 
defined in program (4). First, find the ccll that appends a value on 
selector i ofc. Let kO be the cell. Then | 


Cyg + append cy), i, dy) 


where subscripts k0, kl and k2 refer to cell numbers. Once k2 is 
known find the cell that appends a value on selector j of dy>. Let k3 
be such accll. ‘Then 


d,3 + append(dy., j, Vv) 


Mapping cfi,j] onto cell k will mean that cell k3 will send value v to 
cell k. Suppose we consider the cell definitions of Figure 3. Then 
mapping cfij] onto cell number <i,j> results in all the append 
operations being eliminated. Cell <ij> would compute a value s 
and hold on to it. On the other hand, mapping cfi,j] on cell <i,j-+ > 
would result in cell <i,j> sending the value s to cell <ij+ D. 


It is uscful in a large program to map a data structure according 
to how it is used rather than how it is created. When matrix 
multiply is part of a larger program one will have to take into 
account the cells where matrix c will be uscd to specify efficient 
mappings. A common situation is that of unnested loops where 
one loop produces a structure and the other loop uses it. In such 
cases a cell definition may include onc iteration of each loop. 


5 Mapping Computation cells on processors: 


In general one expects the network of computation cells to be 
larger than the number of processors available.. The mapping in 
such cases takes the form of specifying a folding of the network of 
cells to fit the machine. Suppose we want to map the network of 
Figure 2 onto a p processor machine when | >> p.For the 
interconnections of Figure 2 we consider 3 mappings: 


1. Map cell i on processor number i mod p (see Figure 
6.1). | 


2. Map cells 0 to p-1 on processors 1 to p. Map cells p to 
2p-1 on processors p to 1, and so on (see Figure 6.2). 


3. Map cells 0 to f-1 on processor number I, cclls f to 2f-1 
on processor number 2, and so on where f = 
[(1+2)/p1 (see Figure 6.3). 


If the p processors are connected by a ring bus there may be no 
reason to choose between mappings 1 and 2. However the first two 
mappings are clearly inferior to the third mapping if the p 
processors have any kind of locality in their interconnection. 


‘This small example only illustrates that a reasonable mapping of 
a network of cells onto an actual machine can be derived by simple 
reasoning. In fact we expect to do such simple folding of networks 
automatically. 
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6 Conclusions: 


The success of large multiple processor machines crucially 
depends upon their programmability. Flexibility in programming 
such machines is ultimatcly limited by our ability to decompose a 
program into smaller programs suitable for execution on one 
processor. In the past, decomposition efforts have had limited 
success due to Fortran being the source language. It is suggested in 
this paper that applicative languages with restrictions.on data and 
control structures are far more amenable to decomposition. It is 
generally quite easy to write a maximally parallel applicative 
program for a given algorithm. Undoubtedly the problem of 
decomposing a maximally parallel program is far simpler than 
detecting parts of a sequential programs that are suitable for 
concurrent execution. 


A strategy for decomposing applicative programs for a multiple 
processor machine has been outlined. It creates a network of 
computation cells without relying on any information about the 
topology or the number of processors in the actual machine. The 
network is mapped onto the actual machine as the last step in the 
procedure. Our research efforts for the time being are concentrated 
on dcriving cfficient cell programs for the network of computation 
cells, 
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Figure 2 Computation cell nctwork when loop 1 is 
unfolded in the matrix multiply procedure 
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Figure 3 Computation cell network when loop-i and loop-j 
are unfolded in the matrix multiply procedure 
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Figure 4 Computation cell network when inner loop-i is 
unfolded in the relaxation program(3) 


Figure 5.1 Mapping x{i] on cell i by distributing x in the . 
network of figure 4. (——- show reference pattern, seumee ShOw napping) 
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Figure 5.2 Mapping x{i] on cell i by sending it from the cell 
that gencrates x{i].Since cell i generates x[i] no 
dashed line with valuc x[i] is shown. 
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Figure 6.3 Mapping (3) cells 0 to f-1 on processor 1, 
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Automatic Exploitation of Parallelism ona 
Homogenous Asynchronous Multiprocessor 


Thomas L. Rodeheffer and Peter G. Hibbard 
Department of Computer Science 
Carnegie-Mellon University 
Pittsburgh, Pennsylvania, 15213 


Summary 


This paper describes an investigation which is starting 
the practical issues of automatically detecting 


parallelism in ordinary programs and exploiting it on a 


into 


multiprocessor. We are looking at user programs written in 
Fortran and our target multiprocessor is Cm*, a distributed 
at Carnegie-Mellon 


multiprocessor designed and _ built 


University. 


We have chosen Fortran because of the following 
reasons: we have a modern Fortran compiler written in C 


which is accessible and. easy to modify; Fortran has a_ 


simpler implementation than other commonly-used high-level 
languages; much previous work has been concerned with 
analyzing Fortran programs, thereby allowing our results to 
be more easily compared with the results of others; and 
finally, Fortran is a language of much practical interest to 
the scientific computing community. 


We have chosen Cm* [1] as our target multiprocessor 
primarily because it is a part of our’research environment. 
Since Cm* is the subject of several projects, our work can 
enhance other research. Operating roughly as a classic, 
shared-memory multiprocessor with fifty identical, asynchro- 
nous processors, Cm* has the advantage that its memory 
accessing mechanism is implemented by a_ hierarchical 
switching network whose nodes can be microprogrammed to 
provide special operations in addition to simple memory 
mapping. Finally, Medusa [2], an operating system which 
supports a Unix-like environment but still allows almost a full 
exploitation of the Cm* hardware, has recently become 
available. 

Measurements on Cm* [3] indicate that speedups near 
the theoretical limit are attainable for programs which have 
been carefully designed to také advantage of the available 
parallelism. Unfortunately, sufficiently careful and ingenious 
design has not proved to be a simple matter. Programming 
a multiprocessor is a difficult and tedious task, especially at 
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the detailed levels of inter-processor coordination. Further- 
more, multiprocessors are not generally available and thus 
much work has tended to be theoretical in nature. 


Previous work on automatically detecting and exploiting 
parallelism has been directed primarily at architectures other 
than those of asynchronous multiprocessors. Kuck et al. [4] 
have studied extensively the ways in which programs can be 
transformed to extract parallelism under the assumption that 
the target architecture consisted of synchronous processors 
which perform exactly one operation every time step. Allan 
and Oldehoeft [5] have considered the same problems with 
a data-flow machine as a target architecture. In both cases, 
no consideration need be given to the problems of commu- 
nication and synchronization between the processing ele- 
ments, because such problems are assumed to be solved by 
the architecture at no cost. Gonzalez and Ramamoorthy [6] 
have studied through simulation the problems of scheduling 
on a multiprocessor parallel tasks of a program at the 
statement level. | 


We view the exploitation of parallelism as an optimiza- 
tion technique which is useful on a multiprocessor architec- 
ture. We are interested in automatically taking advantage of 
low-level parallelism—parallelism which would be difficult or 
just too tedious for a programmer to specify but which can 
be detected on a fairly local basis. The more global 
problem of designing a program or algorithm specifically to 
use parallelism falls beyond the scope of what we consider 


automatic optimization techniques. 


We are building a prototype system which compiles 
Fortran programs into machine code for Cm*, detecting 
implicit, low-level parallelism and generating a schedule of 
tasks to minimize the time-to-completion of the program. All 
The 
run-time system on Cm* provides inter-processor communi- 


detection and scheduling is done during compilation. 


cation and synchronization primitives which the compiler 
uses to effect its schedule. 


For our purposes, each of the individual processors in 
Cm* contains a copy of the same code and shares access 
to the same data locations. As explained in [2], such an 
arrangement is not the most effective use of Cm*, but its 
simplicity and. similarity to the normal manner of use of 
tightly-coupled multiprocessors is appealing. This arrange- 
ment is essentially the same as presumed in [6]. 


The compiler processes the Fortran source program on 
a subroutine-by-subroutine basis. Each subroutine is 
compiled into a directed graph of actions, in which each 
action represents an operation at the level of the individual 
operations of expression evaluation, and each edge repre- 
sents a data- or control-flow dependency. The compiler 
then analyzes the flow graph to determine an execution 
schedule for the dctions of the program. 


The compiler uses approximate execution times for the 
various machine instructions and run-time system primitives 
in order to transform the flow graph to reduce the estimated 
time-to-completion of the final object code program. For 
example, a sequence of actions each of which is dependent 
solely upon its predecessor is probably best executed as a 
single task with no internal scheduling actions. 
actions that could be performed in parallel probably ought 
to be executed sequentially without scheduling if the 
overheads required of the run-time system to coordinate 
another processor are too large relative to the time that 
could be saved by parallel execution. 
simplest transformations, however. 


Kuck et al. [4] have developed transformations appli- 


cable to assignment statements and common forms of 
Fortran DO-loops which exploit parallelism to reduce total 
execution time. Although the transformations were designed 
for a synchronous multiprocessor architecture such as an 
array machine, with proper consideration of inter-processor 
7 coordination costs it seems that these transformations could 
be useful in the environment of an Spe multi- 
processor as well. 


Another important class of transformations are those 
which act to defer or distribute overheads so that work is 
removed from the critical, limiting path of the computation. 
| For example, instead of creating a new task at some point in 
the program (which involves the run-time overhead of 


| ~ locating a free processor and communicating the task start. 


address to it) the compiler may be able to identify an earlier 


Even | 


These are only the 
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task whose completion had recently been awaited and | 
arrange to re-use that task by passing signals for synchro- 
nization; this assumes that the task creation and completion 
primitives are more expensive than a signal between two 
existing tasks. 


Our goal is to demonstrate a workable system for 
exploiting low-level parallelism on a multiprocessor. We are 
encouraged by previous. results [7] which indicate that 
substantial low-level data parallelism is in fact available, 
although in that implementation the language run-time > 
support was so complex that performing all analysis at run- 
time was feasible. Now we direct our attention to a 
language of much simpler requirements in order to address 
the practical issues of a workable system. 
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A CONTROLLABLE MIMD ARCHITECTURE 


by 


(a) 


Stephen F. Lundstrom 
George H. Barnes 
Burroughs Corporation 
Paoli, PA 19301 


Abstract -- A MIMD architecture targeted at 
1000 Mflop/sec has been described to NASA. This 
system is targeted to be the Flow Model Processor 
(FMP) in the Numerical Aerodynamic Simulator. 
This paper describes the strategies adopted for 
making a many-processor multiprocessor 
controllable and efficient, primarily by 
decisions that are made at compile’ time. 
Hardware features include the division of memory 
into space private to each processor and space 
shared by all, and a hardware synchronization of 
all processors. The connection network, 
connecting 512 processors to 521 memory modules, 
is an essential element. 


Two main constructs are needed in the 
language to control the architecture. First, an 
expression that a number of instances of a given 
section of code can be executed concurrently, and 
second, a determination as to whether variables 
are local to the instance or global to the entire 
program. 


Performance validations used whole programs, 
not kernels. Simulation and analysis combine to 


demonstrate achievement of the goal of 1000 
Mflop/sec on suitable programs and good 
performance on others. 

Introduction 


Present generation very-high-speed computers 
generally exploit vector algorithms for their 
highest performance. A study for NASA Ames 
Research Center was conducted to determine the 
feasibility of a "Flow Model Processor" (FMP) 
which could achieve a sustained computational 
rate of one billion floating point operations per 
second on complete aerodynamics flow programs 
[1]. It concluded that the dependence on vector 
operations for high throughput was no 
necessary. 


Given that device technology has been fully 


utilized, parallelism can be used to achieve 
performance beyond that possible with a 
uniprocessor. Historically, two approaches have 


been used to achieve parallelism: a pipeline 


(a) this work was done for NASA under Contract 
NAS2-9897 and reported to them in [1]. 
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stage of 
step of 
identical 


where parallelism is achieved by each 
the pipeline operating on a different 
successive operation, or an array of 

execution units each simultaneously evaluating 
the same instruction on different data. 
References [2,3,4, and 5] have recent examples of 
both. In either case the result is a vector mach- 
ine where the data comes from orderly addresses 
in memory and the same instruction acts on each 
data element. 

The Flow Model Processor makes use of the 
parallelism of a MIMD (multiple instruction 
stream, multiple data stream) architecture. The 
architecture includes specific features so that a 
single program can be issued to all the 
processors and result in cooperative execution on 
a single application for a single user. 


This paper describes motivations behind the 
design and some of the strategies used to ensure 


controllability. The architecture described here 
avoids or sidesteps the limitation observed in 
some MIMD architectures which are unable to 


utilize more than a few processors effectively. 
The result is an architecture that is somewhat 
specialized to a class of applications (although 
much less specialized than a vector machine would 
be). This architecture exploits any concurrency 
inherent in the problem, whether or not that 
concurrency can be described as vector 
operations. 


The problem was approached by first studying 
the aerodynamic applications [6]. These 
applications have a large numerical component, 
much inherent concurrency, and simple control 
structures. Due to the wide variation in the 
amount of computation that may concurrently 
proceed between times at which synchronization is 
required, efficient implementation of the 
synchronization function is required. Due to the 
many different natural modes of accessing data, a 
large memory equally accessable to all processors 
is required. Due to the practical limitation on 
the speed attainable in a large common memory, 
and due to the need for speed, an architecture is 
required which allows many memory accesses to be 
from memory local to each processor. 


Software strategy is based on the premise 
that source text submitted to the compiler should 


result in a single program being compiled for all 
processors in the array which will then execute 
it cooperatively. This premise is also advocated 
in [7]. From another point of view, the compiler 
emits a single program which is to be executed 


independently by each of the processors in the 
array. Included by this program are instructions 
which cause the processors to cooperate by 


sharing data and by synchronizing their actions 
appropriately when needed. 


A second element of the strategy is to make 
certain decisions at compile time instead of run 


time. These decisions can then be supported by 
efficient hardware mechanisms, not by system 
software. 

The functional constructs on which a 


language for this architecture is to be based can 
be compared to discussions previously found in 
the literature. A general discussion of parallel 
languages is found in [8]. Some proposed 
parallel languages are directed at the vector 
type of architecture, as in [9,10,11,12], others 
are not [13,14,15]. Some workers have proposed 
that the requisite parallelism can be found by 
starting from algorithms expressed in serial form 


[16,17] so that standard Fortran can be mapped 
onto various parallel architectures without 
language extensions. In the present case the 


architecture is such that the operations 
can be done. independently of each other and in 
parallel are whole’ sections of code, not 
restricted to single operations. 


We believe that the architecture proposed 
here has several advantages over other parallel 
architectures previously proposed and that the 
simulations and performance validations reported 
below uphold this view. While no single feature 
of this architecture is by itself new, we believe 
the combination of features is. Some previously 
proposed architectures have all memory shared 
among all processors, [18, 19, 20, 21] 
without processor private memory for data. In 
some cases, a central control processor is 
involved with the control of interconnections 
between processors, or from processors to memory 
[22]. N such centralized control is required 
here during execution of user programs. To our 
knowledge, fast hardware synchronization as seen 
here has not been proposed for MIMD 
architectures, although any SIMD machine, such as 
in [3], will be synchronized. 


The development of the system 
evalved from the applications to 
architecture (involving both hardware 
software) to a more detailed definition of both 
the hardware and software. In order to simplify 


concepts 
system 


the introduction of the software concepts, they 
will be preceded by a short summary of the 
hardware architecture. Following the software 


concept summary, a more detailed description of 
some parts of the hardware will be provided. 


which 


but — 


and | 
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Hardware Overview 


The block diagram of the 
multiprocessor is shown in Figure 1. 
features of this hardware are: 


proposed 
Salient 


* A prime number of memory modules to reduce 
memory access conflicts. 

* Separation of the memory space seen from 
each processor into a private part, and a section 
shared among all processors. 

* A connection network whereby all processors 
can simultaneously request access to various 
memory modules. 

* Hardware synhcronization, a P-way AND of the 
signal from each processor that marks its having 
gotten to a specific point in the program. 


512 processors has its own 
program counter, its own local memory for program 
and data, and its own connection to a_e shared 
memory. The shared memory is built of many (521) 
independently accessible modules. In order to 
provide connectivity between the processors and 
the memory modules, a connection network which 
has a complexity of O(P log(P)), instead of the 
0(P2) complexity expected for a fully general 
cross~point network, was chosen. This choice 
satisfies both the economic requirements and the 
bandwidth requirements of the system. For 
discussion of the connection network, see [23]. 


Each of the 


Software 


The expense involved in application software 
development and maintenance over the life of a 
system now often exceeds the total costs of 
operations support and acquisition/ amortization 
of .the computational equipment especially in 
development environments. The development of any 
new capabilities for such environments must, 
therefore, carefully consider both efficient 
utilization of the computational facility and the 
efficiency with which application development can 


proceed. In the past, unfortunately, the 
emphasis has been almost entirely on efficient 
hardware utilization. The provision of 
capabilities to embed assembly or machine code 
within high-level languages such as FORTRAN are 
an example of this approach. One recently 
introduced extended FORTRAN supports both 
development, with application-oriented vector 


forms, and efficient hardware utilization [12]. 


The major concern during the study was the 
feasibility of a hardware system with the 
required sustained performance. Automatic 
conversion of standard FORTRAN was not required. 


Rather, the project emphasized the definition of 
FORTRAN extensions that provided efficient 
control of the hardware ease in 


capabilities 
application definition. ! 


Language Overview 


The basic language construct chosen for this 


MIMD system was one of computational processes 
that proceed concurrently between appropriate 
synchronization points. This type of construct 
is clearly compatible with a MIMD system. Such a 
construct is also convenient for application 
descriptions in that it is more general than the 
vector forms currently in use. The concurrent 
processes may include boundary value computations 
and central value computations simultaneously. 
Thus, each program for the FMP has a structure of 
pieces of normal serial code, which describe the 
details of what must be done at a given time, or 
at a given element of some index set, embedded in 
a control structure that expresses the location 
of concurrency and where the synchronization must 
occur. 


Three extensions to standard FORTRAN are 
proposed. The primary extension is the construct 
described above which allows the definition of 


the inherent concurrency in a pracess. This 
construct is called "DOALL". The second 
extension is a construct to allow the definition 
of index sets, called "DOMAIN"s. The third 


extension is a means for identifying the data or 
variable dependencies between the instances of 
various processes and for differentiating which 
variables or data are independent of the global 
process structure and are therefore local to a 
particular instance. 


Domains 


for describing index sets to the 
In FMP FORTRAN such sets are 
A DOMAIN has an associated name 
or multi- 


A means 
compiler is needed. 
called DOMAINs. 


and can be interpreted as a one 
dimensional index set. For example, the 
declaration 

DOMAIN/EYEJAY/: I=1, IMAX; J=1, JMAX 
declares that there are IMAX*JMAX elements, each 


consisting of one pair of values of I and J, with 
values in the range shown. Standard set 
operators are allowed. For example, if one has 
also declared | 
DOMAIN/KAY/ : 
then the cartesian product 
DOMAIN/IJK/: EYEJAY .X. KAY 

defines a three-dimensional domain with extents 
in each dimension of IMAX, JMAX, and KMAX. 


K=1 , KMAX 


In the aero flow applications, only 
rectangular domains such as the example "IJK" 
were seen. Extensions to the domain concept will 


be needed for other applications. Simple 
modifications ta domains can be implemented by 
conditional statements within the doall program 
segment. | 


In addition to their use in specifying the 
index sets for doalls as explained in the next 
section, domains can substitute for the iteration 
index sets in do loops, and for dimensionality in 
the declaration of arrays. 


One convenient use of the DOMAIN construct 
is for the description of the geometry (or 
computational limits) of the problem. By naming 
the controlling index set, and referring to the 
index set by name throughout the rest of the 
program, changes relating to geometry need be 
made in only one place in the program. 


DOALL Construct 


The DOALL construct is the FMP FORTRAN 
extension for describing the inherent concurrency 
in a process. Figure 2 shows the conceptual flow 
of execution in this construct. Once the 
construct is entered, all individual parts may 
proceed simultaneously dependent on the availabi- 
lity of resources. Control is not allowed to 
pass beyond the construct until all individual 
parts (called instances) have completed whatever 
computation they are to do. 


The doall construct consists of a "DOALL" 
header, followed by a doall program segment 
followed by a doall terminating delimiter. The 
header will contain a specification of a domain, 
perhaps by name. If the domain in the header is 
the domain "EYEJAY' as declared in the example of 
the previous section, and IMAX = 100 and JMAX = 
50, then there are 5000 intances of the doall 
program segment to be executed. Each instance of 
the doall program segment can execute indepen- 
dently of, and without any interaction with, 
every other instance of the doall program 
segment. Within each instance, there may not be 
any references to computations within any other 
instance, but no restrictions on references toa 
"old" values exist. The computation within each 
instance may be conditional on location in the 
model, on data, or on any other condition. 


Hardware Support of the DOALL Construct 


An issue is the mapping of the DOALL 
construct onto real processor resources. A DOALL 
construct execution begins when processors 0O 


through 511 pick up instance numbers 0 through 
511. For a DOALL with I and J for instance 
variables as in the example above, each processor 
computes I and J values from the instance number 
by solving the equation 
instance number = J*IMAX + I 

Specifically, I is instance number modulo ‘IMAX 
and J is instance number DIV IMAX. When each 
processor has finished its instance of the DOALL 
program segments, it increments instance number 


._ by 512, computes new I and J values, and proceeds 


- used to create a 
‘processors 
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are outside the domain. 


to iterate thus until the I and J values computed 
Once the processor has 
completed all assigned instances, it drops down 
to a "WAIT" instruction. When all processors get 
to "WAIT", a 512-way AND of the WAITing state is 
"go" signal which causes all 
step to the next construct or 

an essential feature to make 


Thus, 
construct work is a fast hardware 


to 
instruction. 
the DOALL 


synchronization operation. DOALL program seg- 
ments can be as short as a single statement. A 
single-statement DOALL with regular subscripting 
on variables exactly corresponds to a vector 
operation in a vector machine and hence this MIMD 
architecture includes vector computations as a 
subset of its capabilities. | 

Waiting implies processor idle time. In the 
aerodynamic flow and weather codes which were 
analyzed during the study, the amount of 
processing per processor was nearly equal for all 
processors, and hence processor efficiency was 
high, the first processor to finish being only 
slightly ahead of the last. 


Memory Allocation 


System control is simplified by making 
decisions at compile time rather than having them 
made by system software art run time. The 


distinction between the various sorts of memory 
is made in the compiler with help from programmer 
declarations. 

of 


The potential four 


allocation are: 


Cypes memory 


1. A variable or array element is visible to any 
part of the program, can be accessed from within 
any instance of a doall program segment, or from 
any serial section of cade between doall program 
segments. 


2. A variable is a temporary variable which need 
not remain defined after the end of the instance 
in which it is used. | 


3. A variable is so frequently accessed that 
each processor deserves to have its own local 
copy. 

4. A one-to-one relationship between the 


elements of an array and the elements of a domain 
holds. Within the instances of a doall program 
segment over that domain, elements of that array 


are accessed in correspondence to the 
relationship. | 
The exact form of the declarations for 


helping the compiler make appropriate assignments 
of different data to different types of space is 
still under discussion. It is clear that some 
analysis on the compiler's part is possible; an 
array which is subscripted with the instance 
variables inside a doall must be either type 1 or 
type 4, for example. 
extended Fortran, each common area must contain 
variables of only one category. 


The sets of memory declarations suggested toa 
date contain some common features. First, there 


is a declaration to the effect that a variable is 


shared (type 1). Second, there is a declaration 
(or default) that a variable is temporary to the 
instance (type 2). Third, there is a means for 


If the language is to be an 
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declaring that a set of variables is of type 4. 
This last is the "INALL" declaration. The INALL 
declaration couples a variable or array with the 


dimensionality and index set of a domain. For 
example, the declaration 

INALL/EYEJAY/ Cl, C2, A(5) 
declares that there is an element of Cl, an 


element of C2 and five elements of A associated 
with each element of the domain "EYEJAY'. When 
there is a doall construct aver the domain 
"EYEJAY (1,J)" then these variables can be used 
with the doall program segment and each instance 
will have its own copy. Referring to a variable 


such as C2 either without subscripts, or with 
"centered subscripts" i.e., GY 8 et ee ae is 
permissible and functionally identical. Outside 
of doalls over “EYEJAY", these three identi- 
fiers will identify arrays which have 
dimensionality 

CLC IMAX,JMAX), C2(IMAX,JMAX), and A(IMAX, JMAX,5) 
respectively. 


Given that there are two kinds of memory 
Space, memory private to each processor. and 
memory shared by all processors, variables of 
type 2 and type 3 will be found in processor 
private memory, and type 1 would be in shared 
memory. If a variable of type 4 is only accessed 
within doalls over the appropriate domain, and 
always on centered subscripts, it can be held in 
the private memories of the processor that will 
compute the instances that are in one-to-one 
correspondence with the appropriate array 
elements. 


Parallel Functions 


Some common parallel operations and first- 
order linear recurrences would be supported by 
new intrinsics. 


Parallel sum. Consider a variable defined 
within each instance at the end of a doall. The 
parallel sum of all those variables is created, 
which will then be accessible after the end of 


the doall. 512 such variables can be sumed in 9 
steps using inter processor communication. 
Similar parallel functions are parallel AND, 


parallel OR, and MAXIMUM across all instances. 


recurrence. Given 
in each instance of 4 


First-order linear 
quantities B(I) and C(I) 
doall whose index set is I=1, IMAX, form the 
sequence A(I) = A(I-1)*B(I) + C(I). ACO) is 
given as an initial value. As with the parallel 
sum, this function can be implemented in N steps 
when IMAX = 2N, [24] 


Other Software Issues 


Although the mechanisms shown demonstrate 
that one can design a langauge to enhance control 
of the MIMD machine by imposing structure and 
regularity on the MIMD interprocessor 
interactions at compile time, there are certain 


issues which have to be resalved before fixing on 
a final design for the language. 


One issue is a trade between making memory 
allocation decisions based on programmer declar- 


ations and making allocation decisions by 
compiler analysis. Many users of high-throughput 
machines insist on being able to control every 


detail of machine action, out of fear that the 
vendor's compiler will be inefficient if left to 
its own devices. 


Using Fortran as a starting point raises an 
issue that might not arise with some other 
starting point because of the requirement in 
Fortran for separate compilation. At compile 
time the compiler must distinguish between a 
subroutine called within a doall program segment 
where each instace of the doall calls its own 
copy, and a subroutine called outside the doall 
which runs on the array as a whole. The simplest 
solution would be to distinguish between the two 
kinds of subroutine by a difference in the 
SUBROUTINE statements. 


"Every instance of the doall program segment 
must be independent of and free from any side 
effects that would interfere with any other 
instance of the same doall program segment". 
This over-simplified statement is true at the 
first level of understanding of the working of 
the machine. However, steps taken to enforce 
this rule are subject to a trade between authori- 
tarian and libertarian schools of programming. 
There is no hardware limitation on the processors 
fetching or storing any variable in shared memory 
at any point in the program. Since the relative 
timing between actions that occur in different 
instances of the doall is not controlled, this 
allows for data accesses and definitions to occur 
in an uncontrolled. order. Hence there is a 
question about the enforcement of data _ prece- 
dence. Absolute enforcement by the compiler, so 
that code which is emitted is guaranteed to be 
free of data precedence violations, may be 
undesirable. First, such a compiler will be 
unable to detect all cases in which the instances 
are independent of each other and as a result 
will forbid certain. useful functions. Second, 
for some applications [25] a change in the 
sequence of performing the computations will 
change the result to another, different, but 
still acceptable result. One does not wish to 
forbid such programs. However, if the compiler 
made no check, gave the user no help, unnecessary 
errors might be committed. The following rule is 
observed to cover all cases that arise in the 
aero flow and weather codes, and appears simple 
to implement. "If an array element in shared 
memory is used on the right side of an assignment 
statement within a doall program segment then any 
assignment to that array in the same doail 
program segment must be on centered subscripts 
and will be held in a "new" copy of the array. 
The "new" copy will replace the old copy of the 
array at the time of synchronization at the end 
of the doall." 
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Hardware Details 


Instead of implementation details, discus- 
sion below will concentrate on how hardware 
features support the langauge extensions. 


Processor 


Analysis of the aerodynamic flow and global 
weather model programs (provided by NASA Ames 
during the NASF Feasibility Study as samples of 
typical application programs) showed that up ta 
several thousand processors could efficiently 
work in parallel. In these cases, the actual 
number of processors supplied is irrelevant over 
a large range; only total throughput matters. 
The design intent was to supply a processor that 
had maximum throughput at minimum cost. The 
trade-off evaluation was based on assumptions of 
the technology suitable for 1983 delivery and on 
the desire to limit complexity to control project 
risk. The result was 512 processors, each having 
capability of about 3Mflop/sec. 


Each processor has independent integer and 
floating-point execution units with limited 
instruction look-ahead. To hide access time of 


the shared memory, each processor has a one-slot 


queue, called the "CN Buffer", which manages 
accesses to the shared memory while other 
processor operations go on_ concurrently. A 
processor-local memory of about 32K words is 
appropriate to the applications studied. 
Shared Memory 

In reference (1), the shared memory is 
called “Extended Memory" (EM). It consists of a 


prime number of memory modules (521) in order to 
reduce conflicts for the case that the pattern of 


accesses from the processors forms a_ regular 
pattern [26,27]. 

All processors independently compute 
accesses in shared memory, and independently 
access memory. Given that processor no. 1 is to 


access shared memory address A(i) the processor 
will compute address-within-module given by 
L(i) = ACi) DIV 512 

and module number 

M(i) = A(i) modulo 521 
When the addresses being accessed by the 
processors form a vector with constant stride the 
formula for the A(i) is 
7 ACi)=A(0) +p*i 
Here the M(i) fall into 512 different memory 
modules because p and the number of memory 
modules are relatively prime. This is the basis 
for claiming that a prime number of memory 
modules makes certain kinds of accessing 
"conflict-free". 


Features for Fault Tolerance 
Because of the flexibility of the connection 


network, a simple method of providing spare 
processors and memory modules is planned. Each 


CN buffer contains a "replacement unit directory" 
to redirect connections around spare units. 
Single error correction, double error detection 
(SECDED) code covers all memory and transfers 
through the connection network. The connection 
network, being duplexed, has a simplex mode of 
operation as backup. 


Staging Memory 


Staging memory is called "Data Base Memory" 
in (1) where a size of 128 Mwd is assumed. Later 
discussions have centered on a size of 256 Mwd. 
Transfer rates must be on the order of 50 Mwd per 
second to and from shared memory. Access time 
requirements make disk undesirable. If staging 
memory were to be built of semiconductor compon- 
ents, then 256k-bit chips would be desirable. 


The design and control of the staging memory 
has no surprises. The structure is one of a dual 
port memory. One port responds to requests from 
the coordinator for high-speed transfers between 
staging memory and Extended Memory. The other 
port is externally controlled and provides the 
high-speed data path to the rest of the system. 


Connection Network 


The connection network is used like a dial- 
up network, with any processor requesting 
connection to any memory module at any time, with 
the concommittant "message" being an address plus 
one word of data either stored to or fetched from 
the memory module involved. All processors could 
request simultaneously. Blockage must be low 
enough that the average added delay due to 
blockage is small compared to the time due to 
cable delays, access time of the memory module 
and memory conflicts. In addition processors 


must be treated "fairly". In the intended 
applications all processors have an equal amount 
of work to do. If any processor had a low 


probability of making its connections through the 
connection network, then that slower processor 
would tend ta be the last processor arriving at 
the synchronization points, thereby slowing up 
the whole system. 


The chosen configuration (Figure 3) is 
called the "baseline" network by Wu and Feng 28]. 


We first derived it as an isomorphism to the 


Omega network of Lawrie [29]. A parallel paper 
[22] discusses the design and validation of the 
connection network showing that it indeed 
performs as desired. 


The time it takes to make a connection from 
any one of the 512 processors to any one of the 
521 memory modules is estimated at 120 ns., 
barring conflicts or blockage. The throughput 
analysis of the FMP assumed a path width of Ill 
bits. During throughput analysis of the FMP, a 
particular distribution of . shared memory 
conflicts and of blockage in the connection 


network was assumed. After the simulations to 
evaluate performance were nearly finished, 
simulation of the connection network [23] showed 
that the assumed delays were in fact correct. 


Synchronization 


Synchronization is mechanized by the WAIT 
instruction. A processor continues to execute 
WAIT until a "go" signal is received. The "go" 
signal is the 512-way AND of a signal emitted by 
each waiting processor. Synchronization ensures 
that no processor tries to fetch new data until 
that data has in fact been produced, perhaps by 
the slowest processor, in the preceding DOALL 
construct. 


Figure 4 shows a mechanism whereby the 
512-input AND gate is implemented as a tree-form 
cascade of 8-input AND gates (Figure 4 is 
actually drawn for a 27-input AND gate 
implemented as a cascade of 3-input AND gates; 
the number of levels in the tree comes out the 
same in either case). The root node of the tree 
reflects the "GO" signal back to all processors 
when the "AND" output is true at the root node. 
Note that the spare processors must always appear 
to be waiting even when being serviced or checked 
off-line from the primary problem. 


The total delay from the last processor 


accessing a WAIT instruction until the "go" 
signal reaches all processors has been estimated 
at 160 ns. | 


Performance Validation 


NASA had supplied two complete  three- 
dimensional aerodynamic flow codes, solutions of 
the time-averaged Navier Stokes equations, and 
some weather codes. Three of these programs were 
completely analyzed. The method of analysis was 
to determine the calling sequence, the path of 
execution through the entire program, with 
notations as to how often each section of the 
code was called. Appropriate DO loops were 
converted into concurrent "DOALL" constructs in 
which DO iterations are converted into DOALL 
instances. Representative © sections of. the 
programs were exercised in simulation to deter- 
mine running time. Other sections had their 
running estimated based on how their parameters 
were related to the parameters of the simulated 
sections. The most significant parameter was the 
number of floating point operations. per reference 
to the shared memory. The running time and 


-Mmumber of floating point operations in each 
section are each summed to give the running time 


for the whole program and the number of floating 


point operations for the whole program. The: 


quotient of these two totals is then the 


throughput for the entire program in terms of 


floating point operations per second. Details 
are in [1] in Appendix A. 
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The results of this analysis are summarized 
in Table I. In brief, performance met the target 
of 1.0 Gflop/sec for favorable aerodynamic 
applications, and varied from 0.5 Gflop/sec on up 
for other suitable applications. The chemistry 
and radiation portions of the global circulation 
model were not vectorized, but consisted of a 
doall with one instance at each point on the 
globe; the doall program segment having much data 
dependent branching within it. 


Conclusion 


A generalization of vector architectures for 
high-throughput numerical computing has _ been 


presented. The lack of any need to vectorize the 
application should make it more widely applicable 
than are the current generation of vector 
machines. Validation using actual application 


programs supports the expectation of high through- 
put. 


The three programming constructs are the 
parallel execution of many instances of the same 
code, the use of named index sets, and the 
concept of two types of memory, one private to a 


single instance, the other shared across the 
entire program. 
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ARRAY MACHINE CONTROL UNITS FOR LOOPS CONTAINING IFs 
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Abstract -- The control unit is the interface 
between the compiler and the processing part of a 
computer. A number of array (parallel or pipe- 
line) machines have been built with scalar or 
array instruction sets. Most such machines do a 
poor job of handling sparse data arrays and this 
paper addresses how such computations may be 
better handled. We emphasize two areas: 


1. Conditional statements can lead to 
Boolean recurrences that must be solved to gener- 
ate control bits. We discuss hardware for the 
solution of Boolean recurrences. 


2. Sparse array computations lead to diffi- 
cult memory access and data alignment problems. 
We discuss an efficient bit string approach to 
handling such computations. 

1. Introduction 

The control unit of any computer is the point 
at which the compiler meets the rest of the com- 
puter system. Thus, a well-designed control unit 
is necessary in achieving good system performance. 
In an array processor, the control unit is rela- 
tively more important because the system is more 
complex. Furthermore, compiler algorithms should 
be designed hand-in-hand with the control unit to 
achieve high system performance for ordinary user 
programs. 


In this paper we discuss a subject that has 
seldom been handled well in existing parallel or 
pipeline machines, namely, the processing of 
Sparse arrays in an efficient manner. We will 
present ideas that can be applied to array ma- 
chines that execute single-array operations, which 
we denote SEA (single execution, array), and can 
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be regarded as simple parallel or pipeline ma- 
chines. The ideas are also useful in, MEA mul- 
tiple execution, array) machines which can execute 
several array operations simultaneously, and MES 
(multiple execution, scalar) machines, which can 
be regarded as tightly coupled multiprocessors 
whose goal is the speedup of one program at a 

time [KuPa79]. For more discussion of, the above 
notation, see [Kuck78]. 


Specifically, we will discuss three hardware 
aspects of executing programs: accessing data in 
parallel memory units, alignment networks that 
pair proper array elements, and the processing 
pattern of the elements. In a parallel machine 
the. problem is pairing elements in different pro- 
cessors, while in a pipeline machine the problem 
is pairing elements to be fed into the pipeline. 


We assume that a traditional serial language 
is used to specify array operations, and that 
arrays are stored densely in a parallel set of 
memory units. The problem arises when conditional 
statements in loops cause the selection of only a 
limited, random set of the array operations to be 
performed. We will show that there are simple 
synchronous ways of accessing and aligning such 
arrays that should give high performance in most 
programs. 


Two aspects of programs will be discussed. 
First, IF-statements contained in the scope of 
iteration statements (e.g., DO loops) give rise 
to mode bits that are used to control the execu- 
tion of subsequent statements. We will discuss 
the fast generation of such mode bits, even when 
cycles of dependence are involved. This gives 
rise to new algorithms for the fast execution of 
Boolean recurrences. 


Secondly, we discuss the use of mode bits in 
executing arithmetic array statements. Here the 
problem of accessing sparse arrays in parallel 
memories arises. We will present some theoreti- 
cal results, sketch some hardware and give an 
example of the operation of our ideas. Formerly, 
high degrees of vectorization have been achiev- 
able in these cases, but the sparseness of the 
vectors led to poor efficiency unless the arrays 


were first compressed [Kuck76]. 


We do not discuss the handling of compressed 
arrays. Most serial languages do not have ex- 
plicit ways of specifying compress and expand 
operations; however, they may be useful operations 
when arrays are extremely sparse or indexing pat- 
terns are such that the methods we describe per- 
form poorly. Some languages and software systems 
do allow the specification and manipulation of 
Sparse arrays, and these are useful in many appli- 
cations. In [Kuck70], this subject was dealt with 
for a few special cases and the ideas of this 
paper can be extended to this area as well, but 
are beyond our present scope. 


The remainder of the paper contains five 
sections. In Section 2, some background ideas are 
presented. Section 3 discusses Boolean reeur- 
rences and Section 4 is about arrays and mode 
bits. Section 5 contains a detailed example and 
Section 6 gives some remarks and conclusions. 


2. Theoretical Background 


Here we briefly discuss the theoretical foun- 
dations of our work. For more details, see 
[Bane79]. Earlier results about compilation with 
conditional statements may be found in [Towl76] 
and [Kuck76]. 


Consider an arbitrary program consisting of 
loops, assignment statements, and conditional 
Statements. Because of the presence of the condi- 
tional statements in the program, some instances 
of some of the statements may fail to get executed. 
For each assignment statement S, we define a 


Boolean valued function F. of a suitable set of 


(loop) index variables, such that 
(1) Fo has a value for each instance of S; 
and (2) the value of Fo corresponding to a given 


instance of S is 1, iff that instance must be exe- 


cuted. We call Fa the mode function of S and its 


values the mode bits for S. Clearly, the mode 
function of a statement is determined by the con- 
ditions of all the conditional statements whose 
scopes contain that statement. The efficient 
generation of mode bits and their proper use is 
our main concern. 


The statements in the program are dependent 
upon one another in a certain way. Using this 
dependence structure, we can break up the given 
program into a partially ordered set of sub- 
programs. The same results would be obtained, if 
instead of executing the given program serially we 
execute the subprograms in any parallel way, as 
long as a subprogram is never started until all of 
its predecessors have finished. 


No subprogram can be further decomposed along 
similar lines. Moreover, these subprograms can be 
grouped into several classes according to their 
characteristics, amorg which are the classes of 
cyclic mixed subprograms, acyclic subprograms, 
and cyclic arithmetic subprograms. | 


A cyclic mixed subprogram is such that some 
ef the variables defining the conditions of its 


conditional statements are evaluated within the 
subprogram itself. This leads to the design of 
programmable hardware for the solution of Boolean 
recurrences (Section 3). A Boolean recurrence 
B<n,m> of degree n and order m (1 < m <n) is a 
set of equations of the form 

(1 < k < n) 


x, = Oy, Gas Xige cree XD) 


where x -» X_ are Boolean variables and 


n 


mtl Boolean constants. 
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Consider now a subprogram where all variables 
defining the conditions of all the conditional 
statements are computed outside the subprogram. 

If in addition there is exactly one assignment 
statement, all of whose instances can be executed 
independently of one another, then we have an 
acyclic subprogram. Thus the mode bits for the 
unique assignment statement are known at execution 
time. This leads to the use of mode bits to con- 
trol the execution of array assignment statements, 
involving the accessing of memory and aligning of 
data to and from memory. We will see in Section 4 
that hardware for this can easily be added to 
standard indexing hardware, and this extends the 
earlier work on conflict-free array access 
[BuKu71], [Lawr75]. 


A cyclic arithmetic subprogram has one or 
more arithmetic assignment statements which are 


dependent upon one another or on themselves; 
except for that, it is similar to an acyclic sub- 
program. Here also the mode bits are known at 
execution time, but the instances of the assign- 
ment statements can no longer be executed inde- 
pendently. A subprogram of this kind is equiva~ 
lent to an arithmetic recurrence with mode bits. 
A comment is made on the solution of linear arith- 
metic recurrences with mode bits in the final 
section; we do not discuss this problem in detail. 
(The definition of an arithmetic recurrence is 
obtained from that of a Boolean recurrence given 
above by making the obvious changes. An arithme- 
tic recurrence is linear, if each %, is a linear 
function of its arguments.) 


3. Generation of Mode Bits 


3.1 Cyclic Mixed Subprograms 
Consider the following example. 


DO k= 1, 100, l 
IF [C(k) > U(k) + C(k-1)] 


THEN BEGIN 
Sy Ck + 1) = Vik + 1) 
Sy ¥ (k) = C(k + 2) + Y(k - 1) 
END | 
ELSE BEGIN | 
3 C(k + 1) = Wk + 1) 
END 
Let x denote the condition of the IF statement. 
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k 


Then the Boolean variables Xj> xX a < 


on 100 
satisfy the Boolean recurrence B<100,2> described 
below. . 


Sot FRG ee I RE eek 

8 5 Mee Mey ey Fie 

(1 < k < 100) 
where 
xy = 0, Xo = Q; 
aio = [C(1) > U(1) + C(0)], aii aio = 4435 = OD: 
Ayo = [V(2) > U(2) + C(1)], ans 0, 
ag = (Wk) > UCk) + W(k-1)] 
any = [Wk) > UCk) + V(k-1)] 
apy = [V(k) > U(k) + Wk-1)] 
a3 = [V(k) > U(k) + V(k-1)] 
| (k = 3, 4, ..., 100) 


(Here [...] represents a Boolean valued expres- 


sion. ) 


“The Boolean coefficients Ant (1 <k< 100, 


QO < t < 3) of this recurrence can be computed in 
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parallel on a vector machine like one shown in 
Fig. 1. They are all stored in the Boolean- 
coefficient memory. After n sets of coefficients 
are stored, the Boolean-recurrence solver gener- 
ates mode-function bits for n iterations of the 
loop. Those bits are stored in the mode-function 
register and control the parallel execution of the 
true and false branches of the conditional state- 
ment in the loop. In our example, the statements 


Sy and S, are executed in each processor that has 


a mode-function bit equal to 1. Processors with 
mode-function bit equal to O are turned off. 


After Sy and So have been executed, the content 


of the mode-function register is complemented and 
statement S4 is executed. 

If the upper limit of index k is much larger 
than the number of processors n, the execution of 
the loop can be partitioned into n-iteration 
slices. In this case, the computation of mode- 
functions by solving Boolean recurrence can be 
overlapped (pipelined) with the computation of 
Boolean coefficients and execution of the IF 
statement. This way IF statement control becomes 
time-transparent to the original vector machine. 
Thus, we must be able to solve such Boolean re- 
currences. 


We now consider a general cyclic mixed sub- 
program. From this program we extract a Boolean 
recurrence. To evaluate the kt" variable x of 


this recurrence, we need to know certain values 
computed inside the subprogram itself and certain 
values coming from outside. The values (arithme- 
tic and Boolean) coming from outside will be 


ALIGNMENT WE TW ORE _ 
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Control hardware for the loops with IF statement 


completely known at run-time, and they are to be solved in 
treated as constants. The formula defining x 


k 2(log n+ 1) gate delays 
can be expressed in terms of the constants in at ‘ 
k-1 with 
most 2 different ways, but frequently requires n m,,m m = 
much less than that, since several different paths IG log n)2 (2°44) + n(2 +1)] gates. 


may lead to the same expression and hence can be 


: For a proof of this theorem and for results 
combined. 


in the limited fan-in, fan-out case, see [Bane79]. 


; Z ee . 
BU! eoiwbica: HE BeGlGan Récureences As an example, the solution of B<8,2> is shown in 


Fig. 2(c). 
In what follows, n and m denote two integers 
such that 1 <m <n. Consider an arbitrary set 
of m Boolean variables {¥y> Yor sees a The 2™ 
minterms of these variables are numbered 0, 1, 2, (a) 
Lise gmt in the usual way, and the yeh minterm 


is denoted by PCy: Yor sees n° We will use 


AND and OR gates, such that each gate has a gate 
delay of one unit of time. It is assumed that 
each gate gives true and complemented outputs 
with no time or cost penalty. The sole purpose 
of this assumption (which holds for ECL circuit 
family gates) is to keep our formulas simple; ex- 
tension to the general case is easy and straight- 
forward. For any positive integer k, we write 
log k to denote [log,kl. 


Let us define a Super Cell (SC) (Fig. 2(b)) 
to be a piece of combinational logic which takes 


aan" inputs {a,|0 <s < 2h =1} U te. r< 


m O<t< 2"-1} and produces 2” outputs c. de- 


fined by 
ord 
Cc. = a a Pi(by 4» boy re De? 


(O< t < 2™~1) 


where each C, is realized by the logic in a Basic 
Cell (BC) (Fig. 2(a)). The following lemma is 
obvious. 
Lemma 1 

If fan-in and fan-out considerations are 
ignored, then an SC can be implemented in 2 gate (c) 


delays with e241) gates. a 


Consider now a general Boolean recurrence 
B<n,m> of degree n and order m, defined by the 


equations 
gm-1 
a = Set Peer? Meo? 88? Seem’ 
(1 < k < n) 
! 
where the a’s and Xo Kip sees X_atl are known B53 
Boolean constants. eG 
Theorem 1 | (a) Bas ’c Cell (BC) for m= 2 
| a f = 2 
If fan-in and fan-out considerations are (b) per Cell (SC) for m 
ignored, then the Boolean recurrence B<n,m> can be ¢ ‘ution of Boolear recurrence B<8,2> 
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4. Arrays and Mode Bits 


Assuming that mode bits exist, we now discuss 
their use in memory accessing for arrays. We will 
discuss alignment later. Its implementation is 
straightforward with a crossbar switch but less 
costly with an extended omega network [Lawr75]. 


Assume a storage scheme such that the element 
X(I) of an array X is stored in memory module .num- 
ber f£(1), is given by 


f(1) = I + Base,) mod m 


where Base, is the number of the module that con- 


tains X(1) and m is the total number of memory 
modules. The following two results are crucial 
for our discussion; for proofs, see [Bane79]. 
(The notation of this section is somewhat differ- 
ent from those of the previous sections.) 


Lemma 2 


Let Ay» a denote integers such that gcd(a,m) 
1. Then the elements X(a, + ai) and X(ap + aj) of 


an array X are stored in the same memory module, 
iff (j - i) is a multiple of m. a 


An immediate consequence of this lemma is the 
following corollary. 


Corollary 1 


Let aq) a, n denote integers such that gcd 


(a,m) = 1 and 0 <n<m. Then for any i, the set 
of elements {X(ap +al)|i< I< i+n- 1} of an 


array X can be accessed from memory without any 
conflicts. a 


Consider now the program 
DO IT=1, u,l 


S Z(cy + cl) = X(ap + al) op Yb, + bI) 


END 


where X, Y, Z are one-dimensional arrays, u, Cys 
Cy, aps a, bo» b are integer constants, and op some 
valid operation. (The conditional statements are 
not shown explicitly; we deal with the mode func- 
tion of S instead.) Let us assume that mis a 
prime number and that none of a, b, c is a mul- 


tiple of m. If the value of the mode function Fo 


of statement S is equal to 1 for each value of I, 
then everything works just fine. We can fetch 
X(ap + al) and Y(by + bI) and store the result of 


op in Z(cy + cl) for any m consecutive values of 


I, without ever getting a conflict. However, in 
general, Fo) = 1 only for a random set of values 


of I. And only those instances of statement S&S 
are to be executed for which F.() =1. We 


may still fetch the full set 
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{X(ap + aI)|1 < i < m} without any conflicts, but 
now probably only a few of these values need to be 
sent to the processors. 


The above lemma points to a way of avoiding 
this potential inefficiency. We look at a number 
of values of F.(1), much larger than m. A set of 


m or fewer 1's are selected from these values, 
such that if PF, (4) and F, (i) are in this set and 


i # j, then (j - i) is not a multiple of m. The 


‘values of the index I corresponding to these bits 


are guaranteed not to produce any conflicts in the 
memory addresses of the elements X(ap + al) of any 


arbitrary array X, as long as gcd(a,m) = 1. Our 
scheme fails when gcd(a,m) > 1, but then nothing 
can be done in that case; X(ay + al) will lie in 


the same memory module independently of I. If m 
is a large prime number, these instances of fail- 
ure should occur very rarely. 


The selection of m or fewer mode-bits with 


value 1 is accomplished by the Mode-Function Com- | 


pressor (MFC) shown in Fig. 3. The MFC has two 

outputs: mode bits and their corresponding indi- 
ces, and it can be thought of as consisting of m 
content~addressable memories, each storing pairs 
of the form (FoG4), i). Any two pairs (Fo (i), i) 


and (F.(i)> j) have (mod-m)-equivalent index val- 


ues; that is, i= 4 (mod m). Each memory when en- 
abled, reads cut the first value (Foi); i) with 


F, (i) = 1, or the pair (F, (4) = 0, i= 0) is is- 
sued when there is no pair with F, (i) = 1. Mode 


bits are stored in Mode-Function Register as be- 
fore. Fi) = Q will turn off the corresponding 


processor P. which will generate a null result 


that is never stored in any module of the Parallel 
Memory. The corresponding index values are sent 
to the memory address generator which generates 
memory address for each memory unit from the com- 
mon vector descriptor containing ag: a, and Base, 


for each vector X(a, + aI). 


The set of m associative memories may become 
prohibitively costly for reasonable values of m 
(16 to 64) and index set I (1024 to 4096). For 
this reason, we will now describe a less costly 
but slower design of the MFC (Fig. 4). 


Part of this design is similar to a paging 
system. Suppose that the set of all values of the 
mode function Fe (in the Boolean Coefficient Mem- 


ory) has been broken up into a number of "pages," 
each page being m bits long. Page 1 consists of 
the values {F.(1), F.(2), pen Fo(m)}, page 2 con- 


sists of LF. (m +1), F. (m ae ee F,(2m)}, and 


so on. There are L "page frames," where L is some 
suitably chosen number, and a frame is an m-bit 
register. The “page table" consists of L lines, 
where line & gives the number of the page residing 
in frame 2, and a. test bit which is 1 iff frame 2 
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has at least one 1 (1 < &< L). A page is brought 
to the frames iff it has at least one 1. (We 
assume that the sum of all the bits in a given _ 
page is also stored in memory, and that this bit 
is tested before the page is brought out.) No 
page is brought to the frames more than once. Any 
frame can hold any page. When the time comes to 
bring new pages into frames, a frame is refilled 
iff all the 1's of the page originally residing in 
this frame have been used up (as indicated by its 
test bit). We will see that the 1's in frame 1 
are always used up before refill-time, and hence. 
its test bit should be permanently fixed at 0. 


We start by bringing L pages into the L 
frames. Then we choose the leading 1 in each of 
the m columns, i.e., the leading 1 among the lst 
bits of all frames, the leading 1 among the 2nd 
bits of all frames, and so on. The values of the 
index I corresponding to these bits lead to exe- 
cutable instances of statement S, and they do not 
cause memory conflicts. If the position of the 
leading 1 in the kt 
of index I corresponding to this bit is given by 


i = (number of page in frame 2 - 1)m+k 
G<'k< ny 1. 2 < 1, <4 <u) 


If the th column has at least one 1, then the 
index value i corresponding to the leading 1 goes 
to the Memory Address Generator. 


Before the process is repeated, we must reset 
the leading 1-bit in each column and update the 
test bit for each frame. New pages are brought 
into the frames whose test bits are equal to 0. 
And we start all over again. If the loop-size is 
large and the steady stage is reached, we should 
be able to get out m (or close to m) conflict-free 
index values from the MFC, for a number of times. 

5. Example 

In this section we present an example of 
handling sparse array operations using the method 
of the previous section. As was mentioned earlier, 
the idea can be used for a register-to-register 
pipelined processor as well as for a parallel ma- 
chine as sketched here. 


Consider the program of Fig. 5(a), a segment 
of a larger program, in which the X array is 
tested and C(I) is updated whenever X(I) is non- 
negative. Fig. 5(b) shows those index values (I) 
for which this test is true. Given a memory sys- 
tem with five memory units, conflict-free access 
to array elements is guaranteed except for such 
subscripts as 51, 10I + 3, etc. Such a memory is 
shown in Fig. 6. 


A snapshot of the system in Fig. 3 is shown 
in Fig. 7. It is assumed that the entire mode 
function has been computed and stored in mode-bit 
registers inside the MFC. The contents of the 
mode~bit registers are shown in the first row in 
Fig. 7. 


We now have the problem of accessing only 
hose elements of arrays for which the mode bits 


column is &, then the value i 
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DO IT=1, 15 
IF (X(I) > 0) 
THEN C(I) = A(2I +1) + B(I + 3) 
END 
(a) The program segment 


Lig Dg 4s Fy. Oy Ge Thy 2p. 135. 15 


(b) Values of I for which X(I) > 0 
Fig. 5. Program with IF in loop 

Fig. 6. Memory units with stored arrays 
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are 1. The 15 bits (one per loop iteration) are 
folded over in rows of length 5 (one column per 
memory unit). After one leading one's detection 
in each column, mode bits corresponding to index 
values 1, 7, 3, 4, and 15 are selected and they 
appear at the outputs of MFC. At the same time 
the MFC outputs five 1's to the Mode—Function 
Register. The array elements A(3), A(15), A(7), 
A(9), and A(31) correspond to index values of Is 
lx 35 4,-and. 15; 


Using the code © for this first set of 


elements and referring to Fig. 6, we see that all 


oO 


out conflict. 


elements in the A array can be fetched with- 
Similar statements can be made 


oO 


arrays. The second cycle in Fig. 7 shows the 
mode bit registers after the first set of 1's are 
deleted and the results of a second leading one's 


about accessing the elements in the B and C 


detection are presented with the M code: the 
elements are also marked in Fig. 6. On a third 
cycle, only one element, marked A » would be 


accessed. Note that five elements are accessed 
on the first cycle, four on the second and one on 
the third. The effective memory bandwidth will 
always drop off toward the end of a vector access, 
but will remain high on earlier cycles as long as 


the addresses are uniformly distributed across the 


memory units. 


MODE-BIT 
REGISTERS 


MODE- FUNCTION 
COMPRESSOR 
(index values) 
MODE-FUNCTION 
COMPRESSOR 


First cycle. All memory 
elements read or written 
in this cycle are denoted 
by © in Fig. 6. 


Fig. 7. 


mode bits 
MEM. ADDRESS _ _ 
GENERATOR 7 2 1 2 3 3 4 5 4 5 | - 6 - 
(A Arra 
PARALLEL 
MEMORY A(31) A(7) A(3) A(9) ACS) f - ~~ ACAT7),s A(23),—«A(9) (25) FP - )—AQ27) = - - 
OUTPUT 
aieeu © PAG) AGS) ACT) (9), (32) JA(23) (25), ACA). AC) = - AQ7) - 2 
MEM. ADDRESS | ; 
GENERATOR 8 9 8 8 10 | 10 10 9 9 = a - 10 2 
B Array) 
PARALLEL 
MEMORY B(4) B(10) B(6) B(7) B(18) | B(14) B(1S5) B(A1) B(12)) - | - - B16) - : 
OUTPUT | 
ALIGNMENT Tf B(4) B10) B(6) ~=-B(7)—Ss:B(18) J B(14) BCS) -B(12) B(J2) - | -°  - BM6) - . 
OUTPUT 
PARALLEL ; 
PROCESSOR c(1) c(7) ¢(3) c(4) (15) cq) cz) c(8) ci) 3=- ft - - ca3) - 
OUTPUT 
preGeent one cil) Jcq2) cig) «c(9) se - Sc -)  ca3)- - - 
OUTPUT C(7) = €(3) «C(4))— C15) (1) (12) c(8) 
| MEM. ADDRESS : 
(C Array) 


Second cycle. 
elements used in this cycle 
| are denoted by 
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Next, consider the processing of data for 
this program using the five processor parallel 
machine of Fig. 3. The Memory Address Generator 
calculates from index values supplies from MFC and 
vector descriptor supplied by the control unit the 
proper addresses of array elements. For example, 
an array indexed as A(ay + ai) has Ags as 


Base and Base included in the vector 


Unit?’ Addr 


descriptor, where Base and Base 


Unit Addr 
unit number and address of A(1). For each index 
value i the address (I (ap + ai + Basey. it - 1)/m] 


~ 1) + Base is supplied to memory unit 1 + 
2) mod m. 


are memory 


Addr 
(a + ai + Basey it = 


[LaVo80]. We see that the array elements from A 
and B arrays are not paired properly for proces- 
sing. 


For details, see 


This leads us to our final point, consider- 
ation of data alignment between memory units and 
processors. It is obvious that if a crossbar 
switch is provided between processors and memo- 
ries, then the proper alignments would be pos- 


sible. Instead of an o(n*) gate switch between n 
memory and processor units, however, we can employ 
an O(n log n) gate omega network [Lawr75], because 
only uniform shifts and squeezes are involved. 
Thus, an array indexed as A(ay + ai) can be 


aligned with an array indexed as B(i), by a shift 


Third cycle. All memory elements 
used in this cycle are denoted by, 
A in Fig. 6. 


All memory 


in Fig. 6. 


A snapshot for example in Fig. 5 


of ay 


by the difference in their base memory unit num- 
ber is required. 


and a squeeze of a. Additionally, a shift 


These ideas can be clarified in our example. 
Notice that the A array is stored beginning in 
unit 1, whereas the B array begins in unit 3. 
Thus, B must be left-rotated by distance 2 because 
of its base address, plus 3 because of its sub- 
script (I + 3), for a total of 5, which is pre- 
cisely the number of memory units. A rotation of 
distance 5 (mod 5) is no rotation at all. 


The A array, on the other hand, requires a 
Left rotation of 1 (mod 5) because of its sub- 
script (but none due to its base address in memory 
unit 1) and a squeeze of distance 2 (mod 5) be- 
cause of its subscript. This combination is pre- 
cisely that between input and output of Alignment 
Network I. Since pair elements are correctly ac- 
cessed by the scheme described earlier, they are 
correctly aligned using methods for dense arrays. 
More discussion of dense arrays and omega net- 
works can be found in [Lawr75]. It may be ob- 
served that the scheme we are using for sparse 
arrays can be regarded as substituting for one 
element of a dense array, another desired element 
that happens to be stored in the same memory 
unit. For example, A(15) is substituted for A(5). 
and A(31) is substituted for A(1l1). Thus, the 
omega network handles the alignments properly. 


6. Remarks and Conclusion 


The solution of linear arithmetic recurrences 
with mode bits will be studied somewhere else. 
Here it would suffice to make a few comments. 
Since linear arithmetic recurrences of low order 
can be processed in time proportional to the log 
of the serial time, breaking a recurrence into 
two parts to be processed consecutively could 
actually slow down a computation. In certain 
cases, however, breaking up a large recurrence is 
quite profitable. If a very large number of small 
recurrences arise, an MES machine (or an MEA ma- 
chine with very many instruction sequences) could 
execute each one serially (or using limited pro- 
cessor algorithms [ChKS78]). For register-to- 
register pipeline processors with vector registers 
(e.g., CRAY-1), register-contained recurrences are 
desirable since no memory access is needed other 
than at the beginning and end. Also, on any ma- 
chine, remote term recurrences can be speeded up 
by only computing the final sequence required to 
obtain the remote terms. 


To illustrate the basic idea, consider an 
R<n,l> recurrence defined by 


X, =a + by (1 < i <n) 


° ° x, 
i i i-l1 


with appropriate initial conditions. Suppose this 


appears in a loop with an IF statement, so a mode 


bit pattern controls its execution. If one mode 
bit is zero, then this may be computed as two 
independent recurrences, using an initial value 
for x in the zero mode bit position. Similarly, 
if some a, happens to be zero, the recurrence can 
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be broken into two recurrences. 


A new scheme for handling array operations 
inside DO loops with IF statements has been pre- 


sented in this paper. 


The idea of Mode Function 


Compressor can be easily extended to processing 
of any type of sparse arrays on a parallel ma- 


chine. 


We also gave a new result on solvin 


Boolean recurrences. 
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ABSTRACT 


bit-serial 
is introduced. 


A word-parallel, 
microprocessors 
computer systems. 
vectors. 
and word-length of these vectors is 
control the storage and processing array. 


It is 


I. INTRODUCTION 


The parallel processing capabilities of an 
associative processor are highly attractive in 
many non-numeric applications. Operations such 
as searching and sorting are inherently parallel 
in nature, since they may be regarded as a se- 
quence of basic operations such as_ compare, 
shift, and mark performed in parallel on a large 
number of operands. Many organizations have 
been proposed for associative processors [8, 
10]. Of these, the word-parallel, bit-serial, 
or vertical [9], organization has received con- 
Siderable attention. This is due to the fact 
that the bit-serial organization leads to a con- 
Siderable simplification of the hardware in com- 
parison with fully parallel schemes. 

Because of the hardware intensive nature of 
associative processors, they tend to be economi- 
cally viable only in large, high capital cost 
systems. The purpose of this paper is to intro- 
duce an associative processor that is meant for 
relatively small applications. It is based on 
an array of commercially available 1-bit wide 
microprocessors. Machine organization is word- 
parallel, bit-serial. Data is stored and pro- 
cessed in the form of vectors consisting of a 
fixed number of elements. The machine has been 
dubbed VASTOR for Vector Associative Store TO- 
Konto. 

VASTOR is intended as a special purpose 
processor to be attached to a conventional 
mini-computer system. In what follows, the min- 
icomputer will be referred to as the host. In 
such a system, VASTOR would handle those parts 
of the work load that can benefit from its asso- 
Ciative and vector capabilities. 
tive processors in this manner has been sug- 
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Use of assacia- 


associative processor built 
intended as a low-cost auxiliary processor 
Data are organized in an array of fixed number of elements, variable word-length 
Processing proceeds in parallel on all elements of a vector. 
stored in a small general-purpose computer which is 
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1-bit wide 
in small scale 


around an array of 


Information about the location 
used to 


gested by many authors, e.g. [5]. Also many po- 
tential applications have been studied [3]. The 
main feature of VASTOR is that it represents an 
associative structure and its implementation 
that are economically viable in a minicomputer 
system environment. A prototype processor has 
been constructed and tested. 

The main constraints in the design of VA-~ 
STOR were low cost and modularity. This re- 
quired that readily available components be 
used, that internal communication and control be 
kept simple, and that VASTOR should not overload 
the computer to which it is attached. Modular- 
ity also meant that backplane interconnections 
between modules should be kept simple and easily 
expandable. 

The VASTOR processor, figure 1, consists of 
two main components, namely the processing array 
and the controller. The processing array con- 
tains all the storage and processing elements cf 
VASTOR. The controller translates high levei 
commands received from a scalar machine -~-the 
host- into sequences of control signals for the 
processing array. This paper presents a practi- 
cal implementation of the array and its control- 
ler, and describes input/output transfers bet- 
ween the array and the host computer. Algor- 
ithms that may be implemented on vector oriented 
machines such as VASTOR are readily found in the 
literature [2, 3 and 7]. 


II. MACHINE STRUCTURE 


The organization of the VASTOR array is il- 
lustrated in figures 2 and 3. The storage sec- 
tion in the array is an n-word memory, with a 
word length of several kilobits. Operations are 
performed on vectors of data elements, figure 2, 
when the elements of a given vector occupy the 
same bit positions in all words. While the num- 
ber of bits per element is the same for all ele- 
ments of a given vector, it may vary from one 
vector to another. A 1-bit wide processing ele- 
ment PE is a part of every word. Shift-register 
SH provides the main mechanism for data transfer 


among VASTOR words, as well as between the array 
and the outside world. 

VASTOR’s architecture, depicted in figures 
¢ and 3, has the properties both of an associa- 
tive processor and of an array processor, in the 
sense in which those terms are defined in [10]. 
it is an SIMD machine, as are both of these 
types (note that opcode lines are shared by all 
cells in figure 2). Each cell contains a sto- 
rage element which may be used to mark indivi- 
dual words. The I/O structure enables the host 
to read from and write to marked words in the 
memory. This allows VASTOR to be used as a con- 
tent-addressable memory for the host machine. 
Rach cell also has the ability to perform logi- 
cal and arithmetic operations on its memory un- 
der the control of the mark bit, so that one may 
operate (in parallel) on all data elements sa- 
tisfying some arbitrary condition. The above 
features give VASTOR the properties of an asso- 
ciative processor. 

On the other hand, one may leave all words 
selected and use VASTOR as an array of proces- 
sors. Its 1/0 structure allows large quantities 
of data to be transferred to and from the host 
machine via the parallel port on the right of 
figure e. I/O data transfer rate ranges from 
0.5 to 8 Mbit/s, as will be discussed in section 
V. Each cell C can perform data manipulation 
operations on one word of the memory M. From 
this point of view, VASTOR is an array proces- 
sor. Inter-processor communication within the 
array enables handling of data organized in the 
form of a one-dimensional array, hence the word 
"vector" in the machine’s name. Thus associa- 
tive operations may be seen as a particular case 
of array processing, in which a preliminary com- 
putation is used to select data in certain cells 
for further processing or output to the host ma- 
chine. 

VASTOR operations are essentially word-par- 
allel, bit-serial. The major differences bet- 
ween VASTOR and other serial machines, e.g. 
STARAN [10], stem from pragmatic considerations: 
component cost and backplane complexity. 
STARAN’s memory is multi-dimensional: data may 
be accessed either by row (horizontally) or co- 
lumn (vertically) of a 256 row by 256 column me- 
mory array. These two modes of access involve a 
relatively complex interconnection network, 
which is referred to as a "flip network". Such 
a network is not required in VASTOR. 

VASTOR uses 256 conventional 1024 by 1 bit 
random-access memories, all driven by the same 
address lines (cf. figure 2). Operations can be 
performed only on columns of memory. Because of 
this it is a "vertical" computer similar to that 
proposed by Shooman [9]. The I/O structure has 
been designed to compensate for the resulting 
difficulty in communicating with the "horizon- 
tal" host machine. 

When the number of elements in a data vec- 
tor is greater than the number of cells in a co- 
lumn of memory, operations can be carried out on 
"sub-vectors" of 256 elements each. This com- 
promise exists in Shooman’s machine also. | 

As mentioned earlier, development of the 
structure of VASTOR has been heavily influenced 
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by interconnection considerations. The array 
has been designed to use only "daisy-chained" 
and "bused" connections between circuit boards. 
This allows new boards to be added at any time 
to increase the size of the array with minimal 
modifications to the existing backplane. The 
structure is also well suited to large-scale 


‘integration because of the small number of in- 


terconnections required between modules. 

The main implication of the above restric- 
tion on backplane complexity is that it limits 
the inter-word and associative facilities that 
may be used. Hence, inter-word communication is 
accomplished via a shift-register, which in- 
volves a daisy-chain connection between circuit 
boards for both data and control information. 
Moreover, a single bused connection common to 
all words of the array combined with an analogue 
to digital converter (not shown) are used to 
provide limited accuracy associative testing. 

The structure of VASTOR may be discussed in 
terms of three separate features: the intra- 
word storage and computation, the inter-word 
communication, and the associative testing capa- 
bilities. Each of these features is discussed 
briefly below. 


2.1 INTRA-WORD FACILITIES 


Figure 4 shows the components of a VASTOR 
word: two kinds of storage, a 1-bit processor 
and one bit of a shift register. 

The random-access memory referred to in the 
figure as WK constitutes the ‘working store’. 
Data are taken from this memory and returned to 
it during computation. A second memory, refer- 
red to as BK, for backing store, is a serial me- 
mory. its contents are swapped with the contents 
of the working store in pages containing 256 
bits per word. One more bit of storage is 
available for each word in its part of the 
shift-register SH. This may be used for tempo- 
rary storage of operands. It should be noted 
that the intra-word facilities can be expanded 
through the use of the line marked “B” on the 
figure. 

The 1-bit processing element PE with which 
VASTOR has been implemented is the Industrial 
Control Unit - Motorola MC14500B. It performs a 
limited set of primitive operations on external 
data and a 1-bit internal accumulator called RR 
(the result register). Another internal regis- 
ter, output enable or OEN, contains a mask which 
is used to enable selective write-back into 
either the working or the backing store. The 
collection of the OEN registers in all words 
constitutes the output enable vector. 


2.2  INTER-wWORD COMMUNICATION 


The shifter SH is the primary medium for 
inter-word communication. It is the only ma- 
chine feature that defines any order to the 
words. The shift-register SH is divided into 
8-bit segments as shown in figure 5. Each seg- 
ment of SH has two parallel bidirectional ports 
A and B. The B port is connected to one "phrase" 


of eight VASTOR words.The A ports of all seg- 


ments are connected together to form an 8-bit 
I/O bus. 
Two multiplexers CIRC and SHMODE connect 


the serial inputs of the segments of SH to any 
of a number of sources. This allows data trans- 
fer between the shifter and VASTOR words to take 
place in one of the following modes. 


1. VASTOR to shifter - 
through the B port: 
source of data may be 
element PE, the working 
the backing store BK. 


parallel mode 
in this mode the 
the processing 
store WK or 


2. VASTOR to shifter - serial mode 
through the SI1 port: in this mode up 
to eight bits of data may be loaded 
from any word of a phrase into the 
shifter segment. This operation takes 
place in parallel for all phrases. 


cr Shifter to VASTOR - parallel wmode: 
VASTOR words may be loaded in parallel 
from port B of the shifter SH via the 
processing element PE. 


4, Shifter to VASTOR - serial mode: 8 
bits of data can be moved serially 
from a shifter segment to any word in 
the corresponding phrase. This is ac- 
complished via the combined use of the 
output enable vector OEN and the abil- 
ity to circulate data within each of 
the 8-bit segments of SH. 


We should note that in the two serial modes 


2 and 4, only one word of each phrase is in- 
volved in data transfer. This reduces the par- 
allelism inthe array by a factor of eight. 


However, the serial modes are necessary to sim- 
plify byte-oriented data transfer between VASTOR 
and the host machine, as will be discussed in 
section V. 


2.3 ASSOCIATIVE TESTS 

All VASTOR operations may leave a result in 
register RR of the processing element. Contri- 
butions from all RR registers are summed, in an 
analogue fashion, onto a single line. This is a 
Simple scheme to obtain a limited accuracy esti- 
mate of the number of responders S, i.e. the 
number of words with RR=1. The most useful va- 
lues for this number are zero, one and more than 
one. A simple analogue to digital converter is 
used to extract this information from the ana- 
logue sum. 


III. EXAMPLES OF VECTOR OPERATIONS 


This section presents two examples of vec- 
tor operations in order to illustrate the capa- 
bilities of the VASTOR array. In the first ex- 
ample vector addition is described. The second 
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example deals with an associative search for the 
largest element of a vector. 

Let A and B_ be two vectors that are resi- 
dent in the VASTOR array, Figure 6a. It is re- 
quired to obtain a third vector R_ which repre- 
sents the arithmetic sum of A and B. Informa- 
tion regarding the two vectors A and 5 is stored 
in a table in the controller. The table stores 
the relevant parameters for each vector, e.g. 
Starting address in the array, number of ele- 
ments, number of bits, etc. The ADD operation 
is initiated by the host computer by sending a 
high level command specifying the function to be 
performed and the two operands A and 65. It is 
not necessary for the host computer to specify 
such details as the addresses of the operands, 
the number of elements or the element lengths. 
Operands are identified by means of pointers 
into the operand table stored in the controller. 
when the operation is completed, the controller 
returns to the host the value of the pointer 
corresponding to the result vector R. 

Addition is performed in a bit serial, word 
parallel manner. The sequence of operations is 
given in Figure 6b. As indicated in the figure, 
control of the sequence of operations and ad- 
dress calculations are performed in the control- 
ler, while vector operations are performed in 
the array. The optional masking operation at 
the beginning of the sequence disables’ those 
words of the array for which the mask contains 
"O"s. This may be needed when the vectors in- 
volved contain fewer elements than the number of 
VASTOR words. The mask used in such operations 
is set up at the time vectors A andB are 
created. 

An implementation of the binary search al- 
gorithm [3] for positive or unsigned integers is 
given in Figure 6c. In this case the elements 
of the vector are scanned starting with the MSB. 
A one-bit wide vector TEMP masks out’ the words 
that have been rejected at any stage of the 
search. The associative sum S is used to deter- 
mine the first bit position where one element of 
TEMP contains a "1" while all other elements 
contain "O"s. At the end of the search TEMP 
contains "1"(s) in the word(s) containing the 
largest element(s). 

The above examples illustrate the operation 
of VASTOR on short vectors with all bits conti- 
guous in fields. When there are more elements 
in a vector than words in the array, the vector 
may be broken into several subvectors. Each 
subvector is operated on independently. It is 
also possible that the elements of a vector may 
occupy two or more non-contiguous fields ina 
word. In this case the controller repeats the 
operations on the different fields of the vec- 
tor. 


IV. THE CONTROLLER 


The function of the controller is to reduce the 
control overhead required from the host machine 
to drive VASTOR. In order to keep the VASTOR 
array continuously active, 50 control bits are 


needed every microsecond. That -is, a control 
bandwidth of 50 bits/microsecond must be- sup- 
ported. This rate exceeds the bandwidth of the 
entire PDP-11 UNIBUS. Hence, it must be reduced 
to a level which does not prevent the host from 
performing operations not related to VASTOR. 
The controller receives high level commands from 
the host machine, requiring a much lower control 
bandwidth. These commands are then translated 
into the sequences of control signals needed t 
drive the VASTOR array. | 
The complexity of the commands that have to 
be interpreted by the controller is represented 
by the examples given in section III. In order 
to support such operations, a hierarchical ap- 
proach has been adopted. Each level in the 
hierarchy serves to reduce the bandwidth re- 
quired from the higher levels. Furthermore, in- 
terpretation of high level commands has_ been 
made relatively simple because of the use of 
well defined interfaces between various levels. 
The hierarchical approach led to the con- 
troller organization shown in Figure 7. It con- 
Sists of three distinct units. The microcon- 
troller which performs low level looping control 
operations, the buffer memory which is used as a 
communications medium, and the microprocessor 
which is responsible for interpreting high level 
commands received from the host and for space 
allocation within the VASTOkK array. As_ such, 
the microprocessor performs functions similar to 
that of the "interpreter" in ECAM [1]. The mi- 
crocontroller corresponds to the iteration con- 
trol logic in ECAM. The three subsystems of VA- 
STOR “s controller are discussed briefly below. 


4. THE MICROCONTROLLER 
The microcontroller UC serves to remove 
some of the redundancy at its output, the array 
control lines, in order to reduce the bandwidth 
required at its input. Its sophistication, and 
therefore cost, can be selected to provide al- 
most any desired bandwidth at its input. We 
have chosen to implement a device that executes 
sequences of microcode stored inan internal 
Read Only Memory, with primitive branching and 
looping capability. Input commands to the mi- 
crocontroller come from a buffer memory M which, 
in turn, is filled by the microprocessor UP. 

Linear microcode sequencing provides a 
large reduction in the control bandwidth. Hence, 
it was adopted as the main sequencing mechanism 
in the microcontroller. The starting address 
for a given microcode sequence is loaded from 
the buffer M. Since data can be made to appear 
in the VASTOR array in fields of consecutive lo- 
cations, further compression of the control in- 
formation is obtained with a simple 
ter/index register. This counter is 
and tested to control microprogram loops. It 
also serves aS an index register to modify the 
addresses transmitted by the controller’ to the 
array memory. 

some further control bandwidth compression 
is obtained by introducing a data-dependent 
branch. The associative sum of responders is 


decremented 


loop coun- 
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' a boolean 


compared to a reference in the microcode. One of 
two branch addresses is then selected from the 
buffer M. 


4.2 THE BUFFER MEMORY 

The buffer memory is divided into sixteen 
separate task control blocks. These blocks are 
filled by the microprocessor and interpreted by 
the microcontroller. Whenever the microcontrol- 
ler finishes a task it interrupts the micropro- 
cessor to request the address of the next con- 
trol block. Task control blocks contain up to 
26 bytes of information. This includes starting 
and loop control information for the microcode 
of the microcontroller. It also includes speci- 
fications for the operands in the VASTOR array. 


4.3 HE MICROPROCESSOR 


Controller algorithms represented by one 
control block in the buffer memory take from 1 
to several hundred microseconds to complete and 
to interrupt the microprocessor. These inter- 
rupts are usually quite simple to. service but 
would be uneconomically frequent for the host 
machine. The microprocessor is therefore in- 
cluded to provide further compression of the 
control bandwidth. It simplifies the interfac- 
ing software by translating high-level opera- 
tions into sequences of microcontroller tasks. 
In addition to sequencing control, the mi- 
croprocessor performs the storage management 
function. This includes allocating and freeing 
fields of storage, garbage collection, paging 
variables into the working store from the back- 


ing store, allowing the widths of elements (e.g. 


integers) to expand and contract, and segmenting 
vectors longer than the VASTOR array into man- 
ageable components. 


V.  INPUT/OUTPUT 


Data transfer between VASTOR and the host 
machine is generally difficult because of the 
incompatibility of the addressable units in the 
two machines. While a host machine generally 
obtains all bits of a single element of a vector 
with one reference to its memory, VASTOR obtains 
one bit of each element. The transposition re- 
quired to match the two machines is the source 
of the difficulty. | 

The simplest type of vector to transfer is 
vector, which is only one bit wide, 
figure 8a. In order to transfer such a vector 
from the host into the VASTOR array, its ele- 
ments may be shifted serially by bit into the 
shift register SH. This is followed by a trans- 
fer from SH to a column of WK using the parallel 
mode (mode 3, section 2.2). If elements of the 
boolean vector are packed into bytes in the host 
machine, as is the case in some versions of APL, 
shift register SH may be loaded serially by byte 
through its “A” port. In the current implemen- 


tation, data rates for the 
serial modes are 1 Mbit/s 
tively. 

Consider now the case where data is pre- 
sented to VASTOR so that some number of consecu- 
tive bits must be loaded into a single word, 
figure 8b. This may be achieved by first load- 
ing register RR of the ICU from the CONST line, 
figure 4, and then storing the content of RR in 
the enabled word. Due to that two-step sequence 
and the fact that only one word is enabled at a 


bit-serial and byte- 
and 1 Mbyte/s respec- 


time, the transfer rate is limited to 500 
Kbits/s. 
The phrase structure may be used to in- 


crease the transfer rate of byte-organized data, 
as shown in figure 8c. This corresponds to mode 
4 of section 2.2. The data rate achievable in 
this case is 2.5 Mbits/s. In this approach con- 
secutive words from the host machine are not 
loaded into consecutive words of VASTOR. 
Rather, they are loaded into the same relative 
positions in consecutive phrases. A _ sentence 


structure consisting of two phrases per sentence 
also exists and may be used for 16-bit wide I/0 
transfers. The detailed procedure is given in 
reference [6]. 


VI. PERFORMANCE IN APPLICATION AREAS 


This section discusses potential applica- 
tions of a VASTOR processor. The primary appli- 
cation of VASTOR is as an auxiliary processor in 
a minicomputer system. In this case, it would 


serve to enhance the performance of the system 
in vector and associative operations. A second, 
and equally important, potential application 


derives from the fact that VASTOR can be re- 
garded as a collection of 1-bit wide controllers 
driven in parallel by a host computer. Each of 
these two application areas is discussed briefly 
below. 


Table 1. 


Performance Comparison 


Between VASTOR and a PDP-11/45 
with Bipolar Memory in Vector Operations Involving 
256-Element Vectors, with 16 Bits per Element. 


Operation Result VASTOR PDP-11/45 
Execution Time ‘Execution Time 
Microseconds Microseconds 
Compare Vector 4k us/bit * 16 bits 3.225 us/word * 256 words 
= 64 = 825.6 
Addition Vector 10 us/bit * 16 bits 1.9 us/word * 256 words 
= 160 = 486.4 
Mark Vector 3 us/bit * 16 bits 2.5 us/word * 256 words 
Largest = 48 = 640 
Element 
Compare Vector 3 us/bit * 16 bits 2.5 us/word * 256 words 
to Sealar = 48 = 640 
Sum Sealar 336 us/bit * 16 bits 1.5 us/word * 256 words 
Reduction a: 5376 = 384 
Vector and associative operations are per- for a number of operations on 256-element vec- 


formed quite frequently in the operating system 
software of a computer. Symbol table manipula- 
tion and file management are two such examples. 
Also, some computer languages, such as APL and 
SNOBOL, are based upon the organization and ma- 
nipulation of data in the form of vectors [4] or 
character strings [7]. A VASTOR processor is 
ideally suited to such tasks, and hence can take 
a considerable load off its host computer. Ta- 
ble 1 gives an estimate of VASTOR’s performance 
in this area. The table gives execution times 
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tors, where each element is 16 bits wide. These 
times are based on the current implementation 
using a processing element, the ICU, which runs 
at a 1 microsecond cycle time. For comparison, 
the times required to perform the same opera- 
tions in a PDP-11/45 minicomputer are given. As 
can be seen from the data in Table 1, VASTOR is 
an order of magnitude faster than a PDP-11/45 
when executing tasks that involve parallel oper- 
ations on all elements of a vector. However, 
operations such as. sum reduction (adding all 


elements of a vector) take much more time. In 
this case, VASTOR’s performance is limited by 
its inter-word communication facilities. How- 
ever, when dealing with much longer vectors VA- 
STOR “s performance on sum reduction approaches 
its performance on vector addition. This is due 
to the fact that many elements of the vector 
would be stored in the same word of the array. 
At the present stage of development of the 
VASTOR processor, it is very difficult to obtain 
an accurate estimate of the gain in performance 
that would result from adding a VASTOR processor 
to a minicomputer system. While the data in Ta- 
ble 1 indicate that considerable gain can be re- 
alized, this gain will be partially offset by 
the overhead resulting from transferring . data 
between VASTOR and its host computer. This ov- 
erhead is expected to be of the same order as 
that involved in transferring data between the 
main memory of a computer anda disk file. 
Therefore, VASTOR is most suited for use in ap- 
plications where a number of vector operations 
have to be performed before a given vector is 
transferred back to the host machine. 


Stand-alone ICU’s have applications in pro- 


cess control and monitoring. VASTOR may be used 


in situations where a number of ICU’s performing. 


Similar tasks are to be interfaced to a common 
host computer. In this case, VASTOR represents 
an organized way of performing I/O and control 
functions. Each ICU is capable of sampling data 
from and controlling an external device at data 
rates of the order of a few kilohertz. Status 
information and data such as minimum values, 
maximum values, averages, setpoints and enabling 
bits for each device may be kept in the corres- 
ponding working storage. The main limitation to 
this approach is that it is necessary to syn- 
chronize data transfer between the ICU’s and the 
various devices. 


VII. CONCLUSIONS 


The VASTOR processor presented in this pa- 
per represents a trade-off between the capabili- 
ties and cost of the inter-word communication 
facilities in an associative processor. The re- 
sult of this trade-off is a processor that al- 
lows a nontrivial associative processing capa- 
bility to be incorporated in small scale mini- 
computer systems. The communication hardware 
provided in the VASTOR array enables data trans- 
fer among the words in the array without requir- 
ing costly and complicated hardware. It also 
results in simple backplane interconnections 
between different modules. The modular’ struc- 
ture of VASTOR allows its capabilities to be ex- 
panded easily and economically. 

Some of the limitations of the current im- 
plementation of VASTOR are due to the slow speed 
of the processing element used (the ICU). A 


faster and more powerful 1-bit wide processing © 


element can lead to a considerable increase in 
performance without the need for any changes to 
the architecture. In fact, because of the low 
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number of interconnections involved, the struc- 
ture is well suited to integration. Some of the 
possibilities would be the implememtation of an 
array of 1-bit processors, or processors and me- 
mory on a Single chip. Another possibility 
which is currently being investigated by the au- 
thors is the use of atable driven processing 
element made of memory only. Some other limita- 
tions of VASTOR, such as the difficulty of re- 
ordering a vector, are more fundamental. in 
order to perform such operations at high speed, 
a more complex, and hence more costly, inter- 
word communication scheme must be provided. 
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Fig. 6a. Vector addition 
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Fig. 6b. Implementation of vector addition 
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Fig. 6c. Search for the largest element 


45 


aijkeibelwhedwlelibowl° <leib 
K 


HOST COMPUTER 


CONTROL COMMAND 


MICROPROCESSOR UP 
5-5 


BIT/us 


BUFFER 
MEMORY 
M 
MICROCONTROLLER UC 


—_ _ _. CONTROLLER 


Fig. 7. Controller hierarchy - 


MICROINSTRUCTION] 50 BIT/ ys 


VASTOR ARRAY 


——————_> ADDRESS 


WORD NO. {BIT N BITS {BYTE 


a 


ee ///// ae 


eee ////) 


(c) 


Fig. 8. Alternative modes for 
input/output transfers 


Z 
Z 
Z 


“V//77i__ AREA LOADED WITH 1 TRANSFER 


46 


An Outline of the Computer System with Associative Pipelining 


Simon Ya. Berkovich 


Department of Electrical Engineering and Computer Science 
The George Washington University 


Washington, D. 


Summary 


The fundamental ways for increasing the pro- 
ductivity of computer systems are parallelism and 
pipelining. In both cases for the sake of effi- 
ciency the computing processes should be decomposed 
into possibly small and uniform parts. The most 
appropriate elementary computing operations from 
this point of view are provided bya fully parallel 
word-organized associative processor [1]. Unfortu- 
nately, the successful application of the associa- 
tive processors comes across two limitations: the 
implementation of such devices of sufficiently 
large scale is rather diffieult and the necessity 
to make supplementary moves of data in and out of 
the working area cut down the gain in their fast 
processing. 


In this work we consider a new type of compu- 
ter system - dual to the associative processor. 
Its main component is a homogenous array of cells 
[2], which realizes pipelining transformations in 
Space, isomorphic to parallel transformations 


realized by the associative processor in time (fig. 


1). The algorithms of the associative processing 
are based on the alternation of two types of 
commands: (1) ® - the isolation of the subset of 
words having a given indicator and (2) A - the 
multiwriting of given codes simultaneously in cer- 
tain digits of all the words of the isolated sub- 
set. The program in (¢-A) form for the processor 
controls the pipeline elements as well. The data 
are processed during transmission and the number 
of the pipeline elements is equal to the length of 
the program rather than to the amount of these 
data. The above-mentioned limitations on the size 
of the device and speed of the computing process 
fall away, and it gives fresh impetus to the appli- 
cation of the long and well developed theory of 
associative processing. 


The principle of associative pipelining can be 
applied to different types of computers from rela- 
tively small specialized devices to very large data 
processing systems. The computing process can be 
constructed as a succession of the uniformly orga- 
nized data transmissions; if the program is longer 
than the available pipeline length, the processing 
can be arbitrarily divided into successive steps. 
A general purpose architecture is shown in fig. 2. 


The pivotal part of the computing system is 
the associative pipeline in the form of a closed 
curve to decrease possible losses due to fragmen- 
tation. Information storage is spread over a num- 
ber of some devices with cyclic access, which are 
called DLS - "Drum-Like Storage," because a drum 
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presents a clear view of word stream supply. Main 
functions of the control processor are the presen- 
tation of control programs in (@-A) form and the 
dynamic allocation of the DLS and pipeline 
resources. The switching circuit establishes the 
necessary paths between DLS and output units 
through some segments of the pipeline and directing 


interfaces. The control program can be sent to 


such a path essentially simultaneously with the 


data stream. 


In the framework of this architecture it is 
simple to achieve multiprogramming facilities by an 
interleaving technique for data transmissions. The 
solution of the concurrency problems can be orga- 
nized in such a way that as soon as some informa- 
tion starts out to transfer from one DLS to another 
DLS, all the requests to the former should be re- 
assigned to the latter, and the access to the up- 
dated information will be available right away, 
before the whole process of updating will be com- 
pleted. 


Associative transformations of isolated words 
should be extended to some operations concerning 
their collective properties. These operations can 
be applied to sets of short words considered as 
long-word packets, and to data collection as a 
whole for sorting, eliminating duplicates, max/min 
and so on. It is more easy to provide such faci- 
lities for the associative pipeline than for the 
associative processor, because the processor re- 
quires extra circuitry in bulk, while the pipeline 
needs only some additional equipment for its indi- 
vidual devices - directing interfaces and output 
units. 


The pipeline operations are efficient for mani- 
pulating with different types of information 
structures, especially in a table form. They may 
be used in sublanguages based on relational algebra 
as SEQUEL. Associative pipelining is adjustable 
for most reasonable table functions as MAX, MIN, 
COUNT, TOTAL and for transformation operators like 
SELECTION, PROJECTION, DIVISION, and JOIN. The 
computer system with associative pipelining is 
beneficial for inverted file directories, which 
can be organized by presenting the keys of records 
in packet form. The access may be accelerated by 
an order of magnitude and even more. Associative 
pipelining provides not only all necessary infor- 
mation, corresponding to simple key matching, 
but’ more complex searching criteria, including 
logical functions and partial name matching can 
be accomplished in the same time. 


The most crucial question for system applica- 


tions is the pipeline length, i.e., the number of 
(¢-A) elements to be implemented. Estimates show 
that one such element with word length - r about 

40 bits should contain approximately ~0.5+103 gates. 
A moderate system of about 10° logic circuits may 
present a pipeline with ~200 elements. This is 
fairly enough for most information retrieval proce- 
dures, for which are typical the algorithms with 
O(r) number of (o-A) elements. A larger systemon 
the order of ~10° logic circuits may present a pipe- 
line with ~2,000 elements. Such systems may be 
usea for computational problems for some kind of 
algorithms with O(r*) number of (6-A) elements. 


The idea of associative pipelining is in accord 
with data-flow concept [3]. The advantages of this 
approach are hardware/software uniformity, high — 
speed, ease of operational control and multi-access 
using a communication computer. This system natu- 
rally integrates into network environment. It is 
worthwhile to notice that a reply processing can 
be initialized before the completion of the request, 
when it is in (9-A) form. Because any algorithm 
can be realized at the rate of word transmission 
with no bottle-neck situations, associative pipe- 
lining is appropriate for code conversion by send- 
ings and receivings of data, e.g. for any kind of 
encryption and error correction, and for rather 
more complicated real-time signal processing. 
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Fig. ] 
ASSOCIATIVE PROCESSOR - PIPELINE DUALITY 


Associative processors are known to be useful 
for different parallel algorithms, but their most 
powerful applications are in information retrieval. 
The associative pipeline as a dual structure 
has the similar properties too. Hence, this com- 
puter system may be in particular considered as a 
sort of a database machine [4]. In this case, a 


unified approach to different types of information 
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systems can be developed, including features of 
information retrieval and database management. 
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The problem of communication between 
processes in multiple processor systems is 
addressed. Three high level communication 
mechanisms are presented. The first 
mechanism is based on sequential processes 
consisting of modules, procedures’ and 
processes that communicate via procedure 
calls and input/output statements. The 
second mechanism is based on message 
passing consisting of modules with 
conditional send message and receive 
message primitives. The third mechanism 
is based on structured message-passing 
consisting of blocks that receive messages 


at the beginning of a block and send 
messages at the end of a block. 
Programming language constructs for 
Supporting each of the three mechanisms 
are outlined. The structured 
message-passing approach {or abstract 
dataflow approach) has features that 


facilitate automatic scheduling of blocks 
to processors, brings out all parallelism 
at the block level, facilitates 
synchronization without using semaphores, 
and facilitates a design approach using 
abstractions and refinement. 


I. INTRODUCTION 


Most of the multiple processor systeis 
that have been developed during the past 
Several years can be divided into three 


categories: 


a. Tightly coupled systems such as 
multiprocessors with shared memory 
or a shared bus (e.g. C.mmp , UC 
Berkeley's PRIME, Burroughs 
6706/7700, and PLURIBUS). 


b. Loosely coupled systems such as 
distributed systems, and systems 
which communicate by passing 
messages (e.g. HP's 39898, IBM's 
8192). 

c. Networks of computers (e.g. 
ARPANET, ALOHANET, Ethernet). 

The development of each of the above 
systems has required significant software 
development and maintenance. Since 
software is more expensive than hardware 
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of the 
above 
in inexpensive 


or firmware, the implementation 
strategies and policies used in the 
systems is not possible 
multiple processor systems. In this 
paper, an outline of strategies for 
interprocess communication in multiple 
processor systems is presented. A detailed 
discussion is presented in [1,2]. 


STRATEGIES FOR INTERPROCESS COMMUNICATION 


A fundamental concept useful in loosely 
coupled multiple.processor systems is the 
distributed process, dp. A dp is a 
collection of blocks, called dp blocks. 
The dp blocks communicate either by using 
messages or by sharing data structures. 
The details of adp,blocks are discussed 
later on. A dp consists of the following: 

1. Its own address space. 

2. Its own resource environment. 

3. A list of all other dp's it can 
access (capability list) and a list 
of other dp's that can access it 
(access list). 

All communications between dp's take place 
using relatively short messages. The code 
and data associated with a dp is stored in 
the address space. The address space of 


any dp is symbolic (e.g. a collection of 
named objects). The code of a dp is 
executed using the data in the address 


space and the resources available in 
environment. 


the 
The resource environment of 


a dp provides the runtime support for the 
dp. It contains standard library 
programs, a runtime stack(s) for 
supporting activation records, a heap(s) 
for supporting storage needs, a 
processor(s), virtual or real devices, and 
special functional units such as a 
floating point processor, and a PET 
processor. 


REQUIREMEWYS OF INTERPROCHSS COMMUNICATION 


One of the major areas to be addressed in 
the support for dp's is the interprocess 
communication. The following dp support 
requirements are noted: 


a. Apility to share large data 
structures and to communicate short 
messages. 


b. Ability to block requests on 
various conditions. | 

c. Ability to refuse requests for 
resources. 

d. Facility to employ different 
strategies in accessing resources. 


e. Facility to handle local and system 
exception conditions. 
£. Facility to prevent deadlocks. 


Asynchronously executing “dp's can 
commaunicate using three distinct 
approacnes. The tirst approach is based 


on the procedure call or the use of 
monitors [3]. In this approach, a program 


is partitioned into process,es by the 
programmer. In each process, the 
orogrammer makes decisions regarding the 
sequence of statements. An OMODULE 


construct is introduced to realize the dp 
concept. The OMODULE consists of a 
collection of PROCHSSs, MODULES,  IMODULEs 
for snared objects, DMODULES for device or 
control dependent activities, procedures, 
initialization part, and a module body. 
The constructs MODULE, IMODULE, DMODULE, 
PROCESS, and procedure represent the 
dp block. wach MODULE consists of a 
collection of procedures, MODULEs, an 
initialization part, and a body. A MODULE 
establisnes a scope rule for its local 
variaoles. An IMODULE is the extension of 
tne interface module in MODULA [4]. te 
encapsulates shared objects and operations 
allowed on these objects. An IMODULE 
consists of a collection of procedures, 
one or more DMODULEsS, an initialization 
part, and a body. wNesting of IMODULEs is 
not allowed. A DMODULE is an extension of 
device modules in MODULA. The syntax of 
I constructs and the informal 


the above 
seinantics of the constructs are shown in 


oaks 

The second communication strategy is based 
on message passing. A program is 
partitioned into modules by the 


programmer. In each module, the programmer 
makes decisions regarding the sequence of 


statements. Communication between modules 
is by sending messages and receiving 
‘messages. This approach to communication 
avoids the delay inherent in procedure 
calls when the called procedure cannot be 


entered. 


The third communication strategy is a 
structured message-passing approach based 
on dataflow with high level primitives 
bg 2 tr : The highlights of the third 
strategy are shown in the next section. 


STRUCTURED MESSAG#-PASSING 


The structured message-passing approach to 
communication between asynchronous 
processes uses principles of dataflow [5]. 
Basic dataflow has been used in computer 
systems organization and in the 
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specification of algorithms. All 
activities that can be performed in 
parallel are expressed as nodes without 
data dependencies. Activities are 
seyjguenced only when there is data 
dependency. One problem with using basic 
dataflow is that the resulting graphs are 
complex, containing all the details. 
Another problem is the rather restricted 
set of primitives. Basic dataflow nas 
been extended so that users can define 
operations suitable to their applications. 
wach of these operations is a procedure in 


a highlevel language. This abstract 
dataflow approach is supported by a 
dataflow simulator [6,7]. Dataflow 
programs using basic dataflow primitives 


and user defined operations can be run on 
the dataflow simulator. This has opened 
up several possiblities in analyzing 
algorithms for parallelism and functional 
partitioning of programs. 


Hach node in abstract dataflow waits for 
the arrival of tokens on the required 
input arcs. If space is available on the 
output arcs for tokens, then the node can 
be enabled for firing. Thus, communication 
between nodes is accomplished using tokens 
on arcs. Tokens can be thought of as 
messages and specified arcs as data paths. 
This communication facility is different 
from the message passing approach in 
several aspects: 


a. Using a standard firing rule, once 
a node starts firing, it cannot be 


interrupted by other nodes sending 
tokens to it and the node cannot 
walt for tokens from other nodes. 


po. Using a nonstandard firing rule, a 


node starts firing when tokens 
arrive on a specified subset of 
input arcs and continues to accept 


token(s) on a specified 
input arcs. 

c. All tokens generated by a node are 
sent as output on the designated 
arcs either at the end of firing, 
if a standard firing rule is used, 
or during firing if a nonstandard 
firing rule is used. 


subset of 


There are several advantages to the above 
mentioned communication facility: 


a. The communication mechanism for 
each node is the same. 

b. Each node nas a specified set of 
output arcs for sending tokens to 
other nodes and a specified set of 
input arcs for receiving tokens 
from other nodes. 


ec. The communication structure is 
regular and comprehensible. The 
resulting program is well 
structured. 

d. Synchronization is achieved by 
using 


enabling conditions which 


salt 


require at least one token on each 
of the required input arcs. There 
is no need for semaphore variables 
and P and V operations on 
semaphores. 
in the structured message passing 
approacn, a program is an abstract 
dataflow graph (ADG). Each ADG is a 
labelled and directed graph which is an 
interconnection of subgraphs. Bach 
subgraph consists of nodes which are 
interconnected by arcs. Each node and arc 
has a number of attributes. Some of the 
attributes of a node are label, operation, 


and input/output (I/0) arc specification. 
The label of a node is a unigue identifier 
for tne node. The operation attribute 
specifies the semantics associated with 
the node. The I/O are specification 
attribute specifies the input arcs and 
output arcs of the node. Each I/O arc 
specification of a node has ae set of 
conditions that must be met before the 
node can be fired. can be fired. This 
set of conditions is called the enabling 
condition and is represented asa set, 
called the firing semantics set (FSS). 
Some of the attributes of an arc are 
label, token type that the arc can carry, 
and arc capacity. 


A simple example.is shown in Figure l. 
The READER reads a job, copies it into an 
empty buffer, and outputs a filled buffer 
to PROCESS JOB. The PROCESS JOB 
manipulates and fills an empty buffer with 
its results. The buffer received from 
READER is returned as an empty buffer to 
the BUPFER POOL MANAGER. The WRITER 
receives the filled buffer from PROCESSOR, 
outputs the contents of the buffer, and 
returns the empty buffer to the 
BUFFER POOL MANAGER. The node 
BUFFER POOL MANAGER in turn removes an 
empty buffer from one of its Input arcs 
and outputs the empty buffer on one of its 
-output arcs using the policy specified in 
the semantics of BUFFER POOL MANAGER. 
Initially, empty buffers are on the arc 
EMPTY BUFFERS. 


A detailed discussion on 
to ADG, and several examples are shown in 
[9]. In order to use the structured 
message-passing approach in multiple 
processor systems, we need either an 
environment supporting the execution of 
ADGs or a high level language with 
constructs for representing nodes, arcs, 
and tokens. In this paper, the high level 
language approach is pursued. We propose 
a construct for representing nodes. This 
construct is called a dp block. We now 
draw an analogy between the ADG and the dp 


ADG, extensions 


concept, and describe the details of 
dp blocks. 
If we treat a dp as analogous to the 
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execution of a subgraph in an ADG, the 
dp blocks are analogous to the nodes in an 
ADG. Since any number of nodes can be 
fired in parallel depending on the FSS and 


the availability of data, any number of 
dp blocks in a dp can potentially be 
executed simultaneously. The syntax of 


this construct is shown in Table 1. 
The name of the 
the name of 
node. ‘he 


dp block corresponds to 
the operation assigned to a 
label of the dp plock 
corresponds to the node label. The input 
arc descriptions correspond to tne 
description of all arcs incident to the 
node. An arc description consists of arc 
name, arc capacity, label of the source 
block, and the type of token the arc can 
carry. The condition part corresponds to 


an FSS and specifies the set of input are 
names that must have at least one token in 
order to enable the node for firing. 
Those arcs that can receive ae token(s) 
during the firing of the node are 
specified in INPUT. The condition part 
can be a Pascal IF statement using any of 
the input arc names or constants, or 
logical operators such as AND, OR, or WOT. 
If the condition part is absent, then a 
token must be present, on each of the 
input arcs in order to execute the block 
and the block should not contain any arc 
names in INPUT. Tne body of the dp block 
represents the node semantics. The 
constant, type, and local variable 
declarations have the usual Pascal syntax 
and semantics. The statements in the 
block can be any of the executable Pascal 
statements except procedure calls, and 
Dlocks. 


Assignment statements in dp block use 
the Single assignment rule which is 
Similar to those rules proposed by Tessler 
and Chamberlin [19]. 


a 


provided 
If values 


Tne result of a node's firing is 
by the last line in the block. 


have been calculated in the block for the 
output arc names, then these values are 
sent as output by the last line in the 


block provided the condition part is true. 
Outputs can also be sent during the node 
firing. The output ares that receive 
tokens in this manner are shown in OUTPUT. 


PROGRAM DECOMPOSITION 
Since a program is a directed and labelled 


ADG using nodes and arcs with an operation 
assigned to each node, this operation can 


be the name of the ADG. Such a node is 
called a recursion node, and the graph is 
called a recursive graph. Each recursive 


graph is denoted as a dp. Each invocation 
of a recursive graph has a distinct graph 
color. This color is used by all the 
tokens in the invocation of the graph. 


All the invocations of a recursive graph 


can reside in one dp address space, use 
the dp environment, and have the 
capabilities of the dp. Each invocation 
can also use a distinct copy of the dp'‘s 
address space, environment, and 
capabilities. 

There is a special kind of operation, 


called APPLY, that can be assigned to a 
node [6,7]. This operation builds a graph 
dynamically using tokens on input arcs. 
Fach APPLY node is denoted as adp. A 
nonrecursive graph can be partitioned into 


subgraphs using cluster detection 
algorithms [11] or by computing cut sets 
[12,13], satisfying a given objective 
function. The arcs in the cut set 


represent the data paths for communication 
between subgraphs. Each subgraph is 
denoted as ae dp. If a graph contains 
nodes representing recursive subgrapns or 
nodes with the operation APPLY, then each 
‘such node is denoted as a dp. Nodes’ that 
nave: not been denoted as dps are analyzed 
in the above inanner to identify all dps. 
In other words, program decomposition can 
be performed algorithmically. 
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_ Table 1 Syntax of dp_block 


DESCRIPTION 


<label>: BEGIN <name> (<input_arc_descriptions>) 


ar 1 
{<condition>}, 


. { INPUT <arc_descriptions>}, 
(<constant_declaration>}, 

| {<type_declaration>}} 

| {<Local_variable declaration>}* 
| 


& 
{<statement>}, 


I 
{<OUTFUT <arc_descriptions>}, 


( <output_arc_descriptions> ) 


{<condition>} <dp—block> 


<art_capacity>, 

<label_of_source_dp block>, 
m 

<token_type>, <arc_mode>) },, <arc_ 


descri ptions> 
<Any Pascal statment other than procedure 
calis, or dp blocks> <statement > 
<constant declaration in Pascal> <constant _ 


declaration> 


<Type declaration in Pascal> <type 


deciaration> 


<locai-variable 
declaration> 


<variable declaration in Pascal> 


for 


SUITABILITY OF BUBBLE MEMORIES IN PARALLEL 
PROCESSOR ARCHITECTURES 
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Summary 


Advances in computer architecture result from 
creative organizational ideas, improvement and 
innovation in components, and the requirements of 
new applications. Architectural creativity has 
led to various parallel processor organizations. 
Technological inventiveness has produced magnetic 
bubble memories. When bubble technology was a new 
item in the research labs, the major anticipated 
application was mass memory for large computer 
systems [5]. Now that commercial bubble devices 
are available, the applications have in reality 
been microprocessor oriented [9]. Certain prop- 
erties of bubble storage devices make them quite 
suitable as components in a memory hierarchy for 
parallel processors. Five aspects of this suit- 
ability are outlined in this paper. 


Parallel Processor Memory Considerations: A 
parallel processor follows the definition that 
there is a single control unit with the responsi- 
bility of driving a set of identical processors. 
These machines have primary memories from which 
processing elements (PEs) operate. Secondary 
memory is typically disk storage interfaced to 
primary memory, the control unit, or even a host 
computer. Clearly this definition includes real 
machines such as ILLIAC IV [1], PEPE [8], and 
STARAN [2]. It also includes newer ideas such as 
data base machines [4,7]. 


There are two aspects of the memory systems 
that need to be mentioned. First, the amount of 
primary memory per processor is typically much 
smaller in parallel processors than in uniproces- 
sors. For example, ILLIAC IV has a 2K word by 64 
bits memory associated with each processing ele- 
ment, PEPE has 1K by 32 bits per PE, and STARAN 
with its more global, multi-dimensional access 
memory has a 256 by 256 bit memory array associa- 
ted with 256 PEs or 256 x 9216 in the STARAN E. 
These numbers reflect memory component technology 
available at the time the machines were built. 
They also illustrate the comparatively small pri- 
mary memory size used in parallel processors. 
Movement of data in and out of primary memory is 
an important part of total system performance. 
The second aspect is secondary memory and its 
interface to primary. The usual device is a disk. 
This provides good storage capacity but is poor 
with respect to access time and interface path. 


Bubble Memory Characteristics: Magnetic bub- 
ble memory technology became commercially avail- 
able in the late 1970's. An introductory refer- 
ence is [9]. Bubble memories are essentially 
shift register storage structures. Several shift 
path organizations are possible. One of these, 
the major-minor loop organization, offers a good 
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compromise between capacity, access time, and sim- 
plicity of operation. Figure 1 shows the basic 
structure. Information flows serially in or out 
of the device on the major loop but can be trans- 
ferred in parallel as a "page" to or from the 
minor loops. 


Given a major-minor loop organization, the 
access time components for a read are (a) posi- 
tion the page at the transfer gates, and (b) 
transfer to the major loop and shift to the 
detector. Presently available devices have page 
position times of 3 to 10 msec and major loop 
shift times of 4 to 30 msec. At the device level, 
capacities range from 92K bits to 1M bits. Expec- 
tations are that device capacity will double annu- 
ally and that new techniques will improve shift 
rates by a factor of ten in the near tern. 


Support circuits are needed to implement 
memory systems. These circuits include a con- 
troller which provides an interface to other 
equipment through useful features and functions 
such as page buffering, format conversion between 
bits and bytes, maintaining page position, error 
detection and correction, and indicating status. 
The controller exercises control over individual 
bubble devices attached to it. Functions acti- 
vated at one device are independent of other 
devices. That is, bubble memory devices are indi- 
vidually operable. 


Bubble memories clearly represent a new 
choice for designers. The properties and charac- 
teristics of this new choice need to be examined. 


Five Suitability Factors: The five subsec- 
tions that follow are intended to provide a con- 
trast between bubble and disk devices when used 
in parallel processor memory hierarchies. Figure 
2 shows the model for bubble memory usage as an 
intermediate level in the hierarchy. 


(1) Access Time. With the small primary memory 
capacity in typical parallel processors, fast 
access to a secondary storage is important. 
Access to randomly located information in cur- 
rently available bubble memories is about ten 
times faster than access times to random informa- 
tion using movable head disks. Fixed head disks 
match bubble access times but are not competitive 
in terms of cost or modularity. 


Storage allocation techniques for reducing 
access times are applicable to both. However, 
bubble memories have a performance improvement 
due to their unique capability to "stop" the 
rotation. Pages can be positioned at the trans- 
fer gate waiting for an 1/0 command. 


(2) Selectable Input/Output. The ability of an 
individual processor within a parallel processing 
system to execute instructions sent from the com- 
mon control unit, or else do nothing, represents 
one form of local, individualized control. This 
control exists because it is useful, or even 
essential, for devising parallel algorithms. In 
previous architectures, the ability to enable 
local entities applied only to processors and the 
primary memory associated with them. Control of 
a bubble memory system is readily exercised at the 
level of individual components. By using such 
memory components, local control can extend to 
secondary memory. This is a logical extension of 
the need for local control which becomes practi- 
cal through bubble memory technology. 


(3) Localized Memory Addressability. In most 
parallel organizations an additional local control 
feature is memory reference modification. 
Addresses supplied to all processors from the con- 
trol unit can be modified individually within the 
processors. It is this feature that enables 
Simultaneous access to rows or columns of arrays 
through skewed storage schemes [6]. It is a 
variation of this feature that produces multi- 
dimensional access memories [3]. In previous 
machines, local control of addresses was limited 
to the primary memory. Now, with device level 
control of bubble storage devices it is possible 
to extend local addressability to the secondary 
memory. . 


(4) Customized Configurations. Bubble memory 
components are ideally suited to customized design 
of secondary memory configurations. Modular com- 
ponents allow memory design customized to the num- 
ber of PEs and capacity requirements, For exam- 
ple, the number of modules can match exactly the 
number of PEs for bit stream operations. [It can 
be a multiple for byte wide or other size 1/0 
operations. If I/O transfer rates are less 
demanding, a controller can operate more than one 


memory module. Essentially, the technology allows 


a great deal of flexibility. 


(5) Fault Tolerance. First note that non- 
volatile bubble memories are manufactured using 
integrated circuit techniques. There are no 
moving parts or mechanical adjustments. The 
devices are inherently more reliable than disk 
storage units. As further protection against 
failures, storage loops can be provided for single 
error correction and double error detection. When 
an uncorrectable failure occurs, the system 
remains operable with reduced parallelism. It is 
reasonable to assume the fault can be located to a 
replaceable unit providing a minimal mean time to 
repair. - | 
Conclusion: Research is needed to develop 
a better understanding of the data structures and 


algorithms for efficient use of bubble memories in 
parallel processing environments. — 


References 


[1] Gc. Barnes et al, "The ILLIAC IV Computer," 
IEEE T. C., (Aug. 1968), pp. 746-757. 


54 


[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[3] 


[oJ 


; DISK 
STORAGE 


K. E. Batcher, "STARAN Parallel Processor 
System Hardware," Pro. of the NCC, (1974), 
Pp e 405-410 e 


K. E. Batcher, "The Multidimensional Access 
Memory in STARAN,'' IEEE T. C., (Feb. 1977), 
pp. 174-177. 


P. B. Bera and E. Oliver, "The Role of Asso- 
ciative Array Processors in Data Base 


Machine Architecture," Computer, (March, 
1979), pp. 53-61. 


P. I. Bonyhard et al, "Applications of 


Bubble Devices,'' [EEE Trans. on Magnetics, 
(Sept. 1970), pp. 447-458. 


P. Budnik and D. Kuck, "The Organization and 
Use of Parallel Memories," IEEE T. C., 
(Dec. 1971), pp. 1566-1569. 


G. A. Champine, "Current Trends in Data Base 
Systems,'’ Computer, (May 1979), pp. 27-41. 


A. J. Evensen and J. L. Troy, "Introduction 
to the Architecture of a 288 Element PEPE," 
Proc. of the 1973 Sagamore Computer Confer- 
ence, (1973), pp. 162-169. 


J. E. Juliussen, D. M. Lee, and G. M. Cox, 
"Bubbles Appearing First as Microprocessor 
Mass Storage,'' Electronics, (Aug. 4, 1977), 
pp. 81-86. 


Detect (output) 


. Generate (input) 


Transfer 


Minor Loops 


Figure ]. Major - Minor Loop Bubble Memory Organization 


CONTROL 
. UNIT 


Figure 2. Parallel] Processor with Distri buted Secondary Memory — 


* 
ON THE PERFORMANCE OF ON-LINE ARITHMETIC 


Milo%S D. Ercegovac 


and 


: + 
' Aksenti L. Grnarov 


UCLA Department of Computer Science 
University of California, 
Los Angeles, California 90024 


Abstract -- An analysis of the _ per- 


formance and effectiveness of on-line ar- 


ithmetic structures is provided. A_ rela- 


tive comparison with structures based on 


the conventional arithmetic in computa- 


tional problems such as the evaluation of 


scalar and vector expressions and re- 


currence systems indicates speedup and 
cost benefits of on-line arithmetic struc- 


tures. 


1. Introduction 


The purpose of this research is to 
analyze the performance of on-line arith- 


metic structures and provide a_ relative 
comparison with the conventional arithmet- 
the 


evaluation of scalar and vector expres- 


ic in computational problems such as 
sions and recurrence systems. On-line ar- 
ithmetic algorithms have been investigated 
by a number of authors [1-6]. Here we re- 
the 


characteristics that are used in the § fol- 


view only basic definitions and 


lowing discussion. 


An algorithm is on-line if the j-th 
leftmost output digit is computed using no 
more than (j+8) leftmost input digits. 
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Thus, an on-line algorithm is' performed 


always in a digit-serial manner from the 
most to the least significant 


the first digit of the 


digit. In 
order to compute 
result, the inputs have to be known to 8+] 
Thereafter the next 


digits of precision. 


most Significant digit of the result can 


be obtained for each additional input di- 
small in- 


git. The on-line delay § is a 


teger, typically 1 to 5 for the basic 
The 


subtraction, multiplication 


ar 


ithmetic operations. algorithms for 


addition, and 
been described 


The 


square root with S=1 have 
CLG Sls 


division algorithms require §=3 to a4 


in the literature on-line 


Interest- 


po~ 
and rational function evaluation 


depending on the radix [1,4,5]. 
ingly, there is an algorithm for fast 
lynomial 


with an on-line delay of -1 [2]. 


The use of a redundant number system 
in the representation of the variables is 
necessary and desirable in on-line arith- 
metic. Computation of results from left 
the 
redundant number system in the 
the 
quently, the input operands should also be 
the 


number 


to right in all operations requires 


use of a 
Conse~ 


representation of results. 


acceptable in redundant. form. A 


system can be 
the 


The time required to compute one 


redundant system 


used conveniently in on-line. algo- 
rithms. 
gutput digit, tae can be made 
of the length of operands by using inter- 


independent 


nally a redundant representation of the 


partial results. Alternatively, an inter- 


nal carry-Save, structure can achieve the 


same effect. 
The of 


on-line representation 


“number x is defined as 


| -§-5 
| j A5<] + *5467 


and 


The digits x; belong to a redundant digit 


set 


{-0,---,-1,0,1,..-,9} 


where r/2 < e < r-l determines the amount 


of redundancy. 


In general, an on-line algorithm is 
specified | . 
line representations of operands, 
internal values. The recursion 


and some 


is of the form 
A, = PUA5 11% 845'%845"75) 


where A, denotes the internal vectors re- 
quired. by the algorithm. 

the case of multiplication 
the 
representations of the operands 
Ys.) (O1. 
tors at the j-th step require j 


Agly contains 


scaled residual W5o18 on-line 


oe 


radix fr 
digits in the representation. ‘The pr imi- 


tive operations used in the recursion. are 


recursively in terms of the on- 
results 


In general, the internal vec- 


digits is required for 


where A is a truncated value of A. Since 


only a small number of most Significant 


the of 


recurSion can be 


selection 
the 
performed using 


the 
totally parallel 


output digit, 
opera- 
limited 


Thus the recursion step time 


tions, i.e., carry-propagation 


operations. 


or the time t, to obtain one output digit 
is independent of the length of the 
operands and an on-line algorithm can be 


implemented in a highly modular 


without speed degradation. An organiza- 


Fy 
tion of on-line unit as a linear array of 


For example, in> 


and» 


addition, multiplication by a single radix 


¥ digit, one position shift and concatena-_ 


tion. The output digit is determined by a 
- limited precision 
tion{l,2,4,5,7,8]: 


selection. 


75 ~ SOA5-10%8451¥845) 


func- 


plemented in a highly modular manner. 


identical modules operating in parallel is 
shown in Figure l. | 
| The number of modules is determined 
by the precision s of the selection func- 
tion S(A.) and the number of. digits n: 


p = F(n + s)/2d] 


assuming that each module has internally da 


of 


tions of modular organizations of on-line 


digits precision. Detailed descrip- 


units are discussed in [2,3,5]. 


The on-line algorithms are interest- 
ing for several reasons. Since the results 
are always computed from left to right, a 
of : 
overlapping the operations » at the digit 


sequence operations can be sped up by 


level. Furthermore, the interconnections 
in an on-line arithmetic network are much 
Simpler than in a conventional one since 
only single digits are transferred between 
the operation units. Therefore, the struc- 
tures using on-line arithmetic can be im-_ 
The 


on-line arithmetic realizes by definition 


a variable-precision arithmetic with a. 


built-in significance indication: for the 


inputs .of k significant digits the output 


has at least k-& significant digits. 
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Manner | 


The on-line algorithms can be used in 
a floating-point system without difficul- 
should be 
implemented uSing a conventional approach. 
of the 
arithmetic is in 
It be 


in on-line manner and, thus, over- 


ties. The exponent arithmetic 
on-line 

the 
per- 


One apparent advantage 
floating-point 
operand alignment phase. can 
formed 
lapped with the mantissa operation. Howev- 
er, in the present discussion we are as- 
Suming that, given the same resources, the 
floating-point 


exponent operations, 


operand alignment and mantissa normaliza- 
tion require the same time in on-line and 


conventional arithmetic. Therefore, our 


analysis of relative performance of these 


two approaches is restricted to mantissa 


operations. 


of 


conventional arithmetic unit 


We first consider the performance 
on-line and 
Structures (networks) in evaluating scalar 
expressions. In this case we are interest- 
their 
effects of of the interconnec- 


The arith- 


ed in the total delays of networks, 
costs. and 
tion bandwidth on the speedup. 
on-line as well as conven- 


metic units, 


tional, are not pipelined. Later dis- 
the 


effectiveness of on-line and 


we 


cuss relative performance and cost- 
conventional 
networks of pipelined units in evaluating 


vector expressions, i.e., Scalar expres- 


Sions repeated on sets of operands. 
2. Evaluation of Scalar Expressions 


We consider a scalar expression to be 
of the form 


z= E(x) 

where z is a scalar, x iS an argument vec- 
tor of n-digit elements and E is an arith- 
with the 


operators {+,-,*,/, square 


metic expression formed 
floating-point 


root } and the elements of x. 
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at 


E 
non-pipelined arithmetic un- 
of L lev- 
arithmet~ 
the i-th 
all units 


Assume that a network to evaluate 
of 


its, connected aS a tree network 


consists 


In the case of conventional 
at 
begin operating only when 
at the level i-l 


on-line network the units are synchronized 


els. 
ic we assume that the units 
level 
have finished. In an 
with a common digit clock. An on-line unit 
at 


digit as soon as the coressponding 8+1 in- 


level i can generate the first output 


put digits are available. Therefore, a 


network of L levels of on-line arithmetic 
units has the delay ( latency ): 
L 
< (n+ & 
j=] 


T 


OL (8. 


imax’) )*a 


where is the largest on-line delay 


Pinay 
the i-th level, n is the number of di- 
gits and t 


one digit. 


q is the time to compute or load 


of 


conventional arithmetic units has the fol- 


Similarly, a network of L levels 


lowing delay: 


roap) 


the 


and toaD 


between 


where Tay is the time of Slowest 


operation unit at the level i is 
the time to transfer operands two 


levels in the network. 


We assume in our analysis that ee 


is 3 on the average. In the case of con-~ 


ventional arithmetic units, we assume that 


T ent, 


imax — 
arithmetic 
T=0 (n) 
(log n) 2/n if the operation 
O(log*n) . 


where c=l if the conventional 


operation time is and 


Cc = time is 


T 


The on-line and the conventional 


networks, consSiSting of the same number of 
units, are compared using the speedup fac- 
tor S: | 


Sc = CON _ L(cn + 1) 
Tor n+ 4L 
assuming that troap = tae The minimum 


number of levels for which an on-line net- 
work is faster than a conventional network 


is 
era n 
min [=< ~ 5] 


For example, let n=32 and e257 730. Then a 


network with two or more levels is faster 
in on-line arithmetic than in the conven- 
tional arithmetic. For large ees 


sS- (cnt+1)/8_.,> In particular, 


2 ; 
tog) < S(00) < n+ il 


The number of levels required to achieve k 
percent of the maximum speedup is: 


kn 


~ AQ — k) 


L 
The relation between the speedup S and the 
number of levels L is illustrated for n=32 
and ¢c=25/32 in Figure 2. 
number for 


The minimum 


an 


of digits 
which 
the conventional one is: 


on-line network is faster than 


3L 


_ 1 3] 
“min ~ lec - If “ [2 


In the previous analysis the differ- 
in the bandwidth requirements of on- 
ig- 
If a conventional arithmetic unit 


ence 


line and conventional networks was 


nored. 
has a bandwidth of B digits per variable, 
its delay is increased to: 


T. = 


A (n/B + cn)t, 
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and the speedup becomes 


L(n + cBn) 


PO OA FL) 


The additional due to the 


bandwidth limitation, is n/4B for large L. 


speedup, 


3. Organization of On-Line Structures 


of 
with the stage delay tae In 


A pipelined on-line unit consists 
(n+8) 
the steady state, the unit is computing up 
different 
producing the first 

the 
of the 


before, to implement the recur- 


stages 


results, the first stage 
digit of the i-th 
last stage producing the 

(i-n)-th As 


to n 


and 
last digit 


result 
result. 
mentioned 
Sion of an on-line algorithm, the 
that 
of steps must be provided. 
to 
of n digits, the 
the — 


working 
increases with the number 
If the 
be computed to a maximum precision 
at 


precision 
result 
is 
recurSion requires 
j-th step a precision of j digits for 
} < n/2 and a precision of n-j digits for 
j > n/2. Therefore, n simultaneous opera- 
tions in various stages of completion re- 
quire a total working precision of about 


n7/4 digits. 


This indicates that a one-dimensional 
array of modules, shown in Figure 1, would 
the 


not be suitable for pipelining since 


modules (their internal precision) and the 


inter-module bandwidth would depend on the 
relative position in the array. We suggest 
identi- 
illustrated in Figure 3. 
with d-digit 
requires [n/d] rows with a 
row with 


a two-dimensional array that uses 


cal modules as 


This array, if implemented 
wide modules, 
variable number of modules 
the. 


indicated above. 


per 
maximum number of modulés in a row as 
of d- 


digit modules for a maximum precision of n 


The total number 


digits is approximately (n/a) 7/4. In terms 
of the 


dimensional and array units have 


digit circuits, pipelined one- 


equivalent complexities, the later scheme 


having more uniform implementation. 
4. Evaluation of Vector Expressions 


Consider vector expressions that have 
V vector operands and one vector result, 


each of M elements: 


4 = E(X)s a Xx) 

Ze (2). ecee 2) 
and 

Ki = (Kye seer Kay) 


Each vector element is represented with n 


Significant digits. 


A conventional pipelined unifunction- 


al unit is assumed to have N stages with 
the stage delay t. [9]. The time 


to 


required 


compute M results uSing a network of L 


levels of pipelined conventional units, 
shown in Figure 4, is: 
Toop = (NL +M - I]t, 

In this analysis we are ignoring the time 
required to "chain" pipelined units. 

A pipelined on-line unit of array 
type discussed in the previous section, 
has n+ 8 stages for a precision of n di- 


gits. The stage delay is ta: The time re- 


quired to compute M results using a  net- 


work of L levels of pipelined on-line un- 
its, shown in Figure 5, is: 
To.p = (L6 ax AM Sd Ve 


We are assuming that the latencies of a 
conventional and an on-line pipelined unit 


Satisfy the following condition: 
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ent, 


defined 


Speedup factor in this case is: 


NC. = 
S 


where c is in Section 2. The 


eo. COP _ cn(LN + M — 1) 
re = — 
P Torp N(LO oy +n+M 1) 


The speedup factor for several cases 
in which L=4 is given below: 
C n N M Sp 
25/32 32 4 100 4.9 
i 32 4 100 6.2 
1 3:2 8 1000 3.9 
1 64 4 1000 15.0 
For a large number of operands, i.e., 
when M — oo, the speedup is: 
Sp = cn/N 
and in the case of networks with a large 


number of levels L, i.e., when L —-o00: 


S = 


p cn/4 


These results indicate that the additional 
to 
between 2 and 16 for typical precision. 


Speedup due on-line arithmetic is 


One distinct advantage of on-line ar- 
ithmetic is that it can be easily applied 


in cases that are known to be difficult to 


Speed up using pipeline or parallel com- 
puter organizations. For example, non- 
linear recurrences [10,11] cannot be sped 


up by algebraic transformations and thus a 


parallel or a pipeline system organization 


is not useful. Consider an m-th order 
non~linear recurrence 

X(i) = F(X(i-1),...,X(i-m) ) 
for 1<i<M where F requires L levels of 


operations. Using a network of pipelined 


on-line units, F can be evaluated in time 
L | 

(M> & + njt 
(a1, 7 : 


In the case of conventional arithmetic: 


For example, a non-linear recurrence to 


compute the square root of y 


x(itl) = ZEx(i) + 


ee Siem 
x(1)° 
requires k iterations in order to obtain n 


digits of precision. If implemented using 


conventional arithmetic units, the time 
would be 

Toon * Spay + Tapp) 
and | 

ToLp = ony + Sapp) + nity 


in on-line arithmetic. 
5. Cost Considerations 


of 


and on-line networks consisting N 


The implementation costs conven- 


tional y 
arithmetic units are compared with respect 
to the total 
and the costs of data communications 
the The 


network is defined as: 


cost of arithmetic modules 
in 


network. cost of a conventional 


Ccon = ScuNy + (ntoggr) CRN 


K 


where C,.. is the cost of a conventional 


CU 


arithmetic is the total communi- 


and Ny is the 


network. 


unit; Ca 


Cation cost per bit; 
of data paths in the 


number 


Similarly, the cost of an on-line 


network C is defined as: 


OL 
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C = C._.N 


OL ouNy + (loggr + 1)C,N 


K 
assuming one signed radix r digit per data 
path. | 

If the number of modules required to 
implement a conventional arithmetic unit 
with the cost Com at 


linearly proportional to the number of di- 


module is least 


gits, i.e., 


Coy = Coy 
we obtain that 
Coy Com 
c. = 2 
OU OM 


Since the number of modules in an on-line, 


non-pipelined unit is proportional to n/2. 
Let | 

Som _ e 

Com 
The ratio of implementation costs can now 


be expressed in the following form: 


Bs Coon 1 + RyX 
con “/2G + x 
where 
Hlog.r 
fe 3 2 


is the communication cost ratio and x is 


defined as, 


(logor + 1)N,C 


x = 2 
nFN Coy 
where G, H and F are implementation- 
dependent parameters. We estimate [12] 


that for non-pipelined units G=l1, H=n_ and 
F=] 


and )\=c assuming a Stage delay of ty un- 


while for pipelined units G=2c, H=l 


its. 


The cost ratio R indicates, for exam- 
ple, that the sufficient condition for an 
on-line, non-pipelined network to be less 
costly than the conventional one is that 
the cost of the on-line module is no 
than the ) 


module. 


more 


twice cost of the conventional 


6. Concluding Remarks 


On-line arithmetic offers an alterna- 
tive approach in achieving higher speed in 
arithmetic 


numeric computations. On-line 


is complementary to other approaches that 
are used to achieve concurrency in execu- 
tion of algorithms: for example, it can be 
used in minimal-depth tree-structured net- 
works. In particular, the use of on-line 
recurrence 
The main 

(a) high 


and (b) simple interconnection 


arithmetic in non-linear sys- 


tems would be advantageous. 


features of on-line networks are 
modularity 


requirements. These properties make on- 


line arithmetic very attractive in recon- 


figurable networks. Importantly, the on- 


line structures are easily extendable to 


accomodate either more levels or higher 
Thus it is interesting to com- 
with 


results of 


precision. 
pare the on-line arithmetic networks 
the The 
this study indicate that by using 


conventional ones. 
on-line 
arithmetic, beSides highly reduced commun- 
ication requirements and modular, uniform 


implementation, one can expect an addi- 


tional speedup factor of 2-16. 
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Abstract -- In this paper we use a simple 
graph model to describe the routing algorithms 
for a class of circuit switching networks in inter- 
processor communication including permutation 
and full processor communication networks. In 
full processor communication systems, processors 
communicate with each other in arbitrary pairs 
as opposed to pairs between two disjoint sets of 
processors in a permutation network. It is also 
assumed that the pool of processors is hierachi- 
cally structured and the minimum connection paths 
for local (and/or global) connection are desired. 
We propose such a full processor communication 
network with optimized connection paths. Both 
size and routing complexities are shown to be 
O(N log N). 


I. Introduction 


The interconnection network is an essential 
part of a multiple-processor system and has been 
widely investigated as a means of interprocessor 
communications. These networks are generally 
classified as non-blocking, rearrangeable, or 
blocking in terms of their flexibility in inter- 
connection. A special class of the interconnec- 
tion networks is the multi-stage organization. 
This kind of organization has appeared in various 
literatures [CLOS 53, BENE 65, WAKS 68, OPFE 71, 
STON 72, FENG 74, BATC 76, SIEG 78, WU 78, 

NASS 79, etc]. Research problems associated 

with multi-stage interconnection networks include 
system topology, connectivity, control structure 
(routing), fault tolerance, and. cost-effectiveness 
of the system. In an SIMD or an MIMD environment 
two major interconnection schemes that are of 
interest are permutation networks and partition 
networks. A permutation network performs specific 
one-to-one connections between two disjoint sets 
of processors while a partition network partitions 
a set of processors into disjoint subsets such 
that the processors within each subset can commun- 
jcate with each other. A special case of parti- 
tioning in which a set of processors is partitioned 
into pairs of processors will be referred to as 
full processor communication throughout the paper. 
This kind of full processor communication can be 
achieved by extending some existing permutation 
networks. In this paper we will discuss some 
proposed full processor communication networks 

and then present an interconnection network for 
full processor communication with optimized local] 
connections, i.e., a network in which the pool of 
processors is hierachically structured and the 
minimum connection paths for local (and/or global) 
communications are obtainable. The complexities 
of routing and switching elements in the network 
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are discussed. 


In Section II we review a bipartite graph 
routing algorithm for permutation networks. The 
algorithm can also be applied to the routing 
control in the full processor communication 
models. Section III includes the discussion of 
existing full processor communication networks 
and a network with localized property is pro- 
posed. The routing for such network is developed. 
A formal description of the routing is discussed 
in Section IV. 


II. The Bipartite Graph Routing Algorithm 


A general structure of multi-stage networks 
which allow complete permutation of a set of 
processors is shown in Figure 1. This NxN 
(where N=2") network is recursively defined and 
Pi denotes.a complete permutation network of 
size (2'x2'). Each (2x2) switching element may 
assume one of the three states as indicated in 
Figure 2. This network is a special case of the 
general Clos network [CLOS 53]. Its structure 
covers networks such as base line, omega, and 
indrect binary n-cube networks since it has been 
shown that these networks are topologically 
equivalent [WU 78]. It is also shown by Clos 
[CLOS 53] that this network can realize all N:! 
permutations of the N inputs. The argument is 
usually made by induction using HALL's theorem 
[HALL 35] although it can be illustrated easily 
in the bipartite graph algorithm. Connections 
between processors (routing) can be established 
by some local addressing schemes [WU 78] or by 
a centralized routing control [OPFE 71]. Exist- 
ing routing algorithms for realizing any permuta- 
tion using only two states, straight and cross, 
have been shown to have the complexity of 
O(N log N) if the algorithm is implemented in a 
single processor system. 


The bipartite graph algorithm is demon- 
strated as follows. The structure of the network 
in Figure 1 is recursively defined with 0(1og.N) 
stages. Figure 3 shows an example of such net- 
work with N=8. This is essentially a 8x8 Benes 
network. For a given permutation (or connections), 
the routing is to determine the switch settings 
of the entire network such that desired connec- 
tions can be achieved. If we set the switches 
iteratively from outer stages into the inner 
stages, we observe that after each iteration the 
network is divided into two independent subnet- 
works. This property jeads us to a simple con- 
clusion, 7.e., in order to connect two processors 
(one from the left hand side, one from the richt 


Wh w&S 


Figure 1: A Base-2 (NxN) Multistage Permutation network (N=2"). 
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Figure 2: Three States of the (2x2) Switching Element. 
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Figure 3: A 8x8 Benes Network. 
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Figure 4: Coloring of Graph for Switch Setting in Figure 3 Network 


hand side), they must be switched either both to 
the upper subnetwork or both to the lower sub- 
network. This is the basic idea behind the graph 
algorithm. The graph algorithm is illustrated 

by the following example of permutation 


a 12 3°45 6 5) 
3 7-40.20 15 


graph that represents this permutation. The 
symbols '<_' or ' >! denote a switch and the 
lines across the two set of numbers (processors) 
denote the desired connections. A mark 9° ona 
Switch indicates that the corresponding processor 
will be switched down (or up) while the other 
processor to the same switch should be switched 
up (or down). The whole graph is marked such 
that each pair (two numbers linked by a line) are 
both marked or both unmarked. This process en- 
sures that the two processors in each connected 
pair will always go to the same sub-permutation- 
network in the next stage. It is obvious that we 
can rephrase the marking process by saying that 
the paths in the graph are marked alternately. 
The same process is repeated as shown in Figure 
4-b and Figure 4-c for the subnetworks. It takes 
log N iterations to complete the marking of the 
graphs and therefore the switch settings of the 
entire network. The result is shown with dashed 
lines in Figure 3. 


Figure 4-a is a bipartite 


There is a non-conflict marking for every 
permutation graph since there are an even number 
of paths and the marking is done alteratively. 
After log N iterations we will always obtain 


log N-T 99) subgraphs and still maintain the 
desired connections. It is always possible to 
realize a (2x2) subgraph by a (2x2) switch. Thus 
all N: permutations are realizable by using the 
interconnection network in Figure 1. The routing 
algorithms using the graph model requires the 
traversal of the graph and is of complexity O(N). 
There are a total of log N stages in the network. 
The overall complexity for setting the switches 
for any permutation is therefore O(N log N). 
Since subgraphs are independent, parallel pro- 
cessors may be assigned for computing the routing. 
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pe! processors may be used to set the 


i-] 

a1 
Switches at the ith iteration. Thus a parallel 
algorithm would require the following computa- 
tions: 


N,N N 
NB te ee Toga 
For large N, this would have an upper bound of 
2N. We reduce the time complexity from O(N log N) 
to O(N) if parallel processors are available. 


III. The Full Processor Communication Models 

In full processor communication systems, 
processors communicate with each other in arbi- 
trary pairs as opposed to pairs between two dis- 
joint sets of processors in a permutation network. 
Full processor communication can be achieved by 
including the loop-back state of the (2x2) 
switching elements or by using additional two-ways 
(straight and cross states) switches in a conven- 
tional binary switching network. Several full 
processor communication models are presented in 
this section. Subsection A describes a non- 
blocking network using three-state switching 
elements with complexity O(N2). Subsection B 
introduces a blocking interconnection network with 
complexity O(N log N). Finally in subsection C, 
we present a rearrangeable model with optimal 
connections and of complexity O(N log N). 


A. A non-blocking network using three-state 
switching elements 


By using all three states of the (2x2) switch- 
ing elements as shown in Figure 2, Gecsei [GECS 77] 
shows a non-blocking full processor communication 
system with O(N2) switching elements. A typical 
eight processors network is shown in Figure 5 in 
which pairs of processors (07) (16) (25) (34) are 
to be connected. For N processors the possible 
ways of connections is (N-1)x(N-3)%... 3%]. The 
total number of switches required is (N-2) + 
(N-4) +... 42 = N2 _ 7 The non-blocking 
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Figure 6: A Sixteen Processors Communication Network. 


property of the network can be easily shown by 
induction. 


B. A blocking network using three-state switch- 
ing elements 


The non-blocking network previously des- 
cribed becomes impractical for large N since it 
has a size complexity of 0(N2) and an average 
delay of O(N). 
By incorporating the loop-back state in the (2x2) 
switching element, it becomes possible to connect 
a pair of processors in the same side of the 
multi-stage permutation network. The permutation 


network thus becomes a full communication network. 


of 2N processors. Figure 6 is an example of 8x8 
sixteen processors network for the connection of 
{(1 2) (4 5) (0 15) (3 7) (6 12) (8 14) (40 13) 


Better solution must be solicited. 
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(9 11)}. Since the network is hierachically. 
structured we can introduce the concept of local 
connections, e.g. connections (01) (23) (45) (67) 
are considered as the first level local connec- 
tion, (03) (46) as the second level local connec- 
tions, and (15) (36) as the third level local 
connections, etc. Connections between the two 
sides of the network are considered as long dis- 
tance connections. If we use binary numbers to 
name the processors, then the levels can be re- 
presented in bit positions. Assume that lower 
level local connections are more likely to occur 
than the long distance connections. It is there- 
fore desirable to have minimum delays for the | 
local connections such that the overall perfor- 
mance of the routing delays can be improved. With 
some modifications the graph algorithm presented 
in Section II can be used for the routing contro! 


in the full processor communication systems. It 
is illustrated in the following example. Figure 
7 is a connection graph similar with that of the 
permutation network. Again the '<_' or ' >! 


denotes a switch, the curved lines and the straight 


lines represent local connections and long dis- 
tance connections respectively. There are two 
unconnected sub-graphs in the graph. Both sub- 
graphs have odd number of paths. An alternating 
marking is possible only if the number of paths 
is even in a sub-graph. The sub-graph 


4 
<5) 
the minimum number of paths possible in a graph. 
Such a graph implies an immediate loopback since 


is called a minimum sub-graph since it has 


<i > 
<3 : 


<s) 13 
so 15 


Connection Graph for Connection 
of Sixteen Processors 


Figure 7: 


the alternating marking is impossible. This 
loopbacked switch can be utilized by other sub- 
graphs. A non-minimum sub-graph with even number 
of paths can be marked as usual. If it contains 
odd number of paths, some rearrangements have to 
be made. This rearrangement must utilize the 
loopback switch, if available, to make an even 
paths graph such that an alternating marking is 
possible. Figure 8 shows such an example of 
marking. The path between 0 and 15 is deleted 


Figure 8: 


Rearrangement of Connections. 


and connections of (0 4) and (5 15) are estab- 
lished. This rearrangement makes an even path 
subgraph and yet maintains the connection of 
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(0 15) because the (4 5) connection is a loopback 
and the output of the switch is shorted. The 
marking is done for the first iteration (Figure 8) 
and the outside layer of switches is set accord- 
ingly (Figure 6). The second and third iterations 
of the marking process are shown in Figure 9 where 
a, b, c are the common connection points due to 
loopback switches. The final switch setting 
dashed lines in Figure 6) shows that all local 
connections have the shortest paths at the ex- 
pense of the prolonged delay of the long distance 
connection (0 15) which has a delay of 9 switches. 


It can be seen from the above graph example 
that a connection is realizable only if all sub- 
graphs have an even number of paths in each itera- 
tion or all non-minimum subgraphs with odd number 
of paths can be made into even paths subgraphs by 
combining with minimum subgraphs. Although the 
network has a complexity of O(N log N) and has 
local connection property, it is a blocking net- 
work. Figure 10 is an example that connections 
can not be realized. The connection {(0 8) 

(1 11) (2 12) (3 13) (4 14) (5 6) (7 15) (9 10)} 
involves two sub-graphs with odd number of paths 
and no minimum subgraph can be utilized. In 


this example the loop back of the local connection 


(9 10) forces the two terminals 8 and 11 to both 
go to the upper subnetwork or the Tower subnet- 
work. The two terminals 0 and 1 to be connected 
with 8 and 11 can only go to different subnet- 
works. Thus one of the connections cannot be 
made unless there is a minimum subgraph that pro- 
vides a loopback for the connection. 


C. A rearrangeable network. using two-state 
switching elements 


Both non-blocking and blocking full processor 
communication networks presented earlier use 
three-state switching elements. The former net- 
work requires 0(N*) switching elements while the 
later has a size complexity of O(N log N). We 
now propose an O(N log N) full processor communi - 
cation network which has optimal local connec- 
tions and requires only two-state switching 
elements. Figure 1] is a sixteen inputs modified 
reverse-exchange network. The sixteen outputs 
on the right hand side are shorted to form the 
connections {(0 8) (1 9) (2 10) (3 11) (4 12) 

(5 13) (6 14) (7 15)}. It can be seen that four 
Switches in the lower right corner are redundant. 
The network now consists of two parts: a 
partition network on the left and a permutation 
network on the right. The partition network 
partitions the input processors such that half of 
the processors goes ‘to the upper part of the 
permutation network and the other half is sent to 


the lower part of the permutation network. For 
any desired full processor communication, e.g. 
the eight sets of connections (0 15) (1 4) (2 7) 


(3 14) (5 9) (6 11) (8 10) (12 13) of sixteen 
terminals, if we can partition the terminals 
into two sets such that the two terminals in all 
the connection pairs appear in different sides 
of the permutation network, then it becomes 
possible to achieve the desired connections. 
will show by using the graph model that the 
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A Modified Reverse Exchange Network for Full Processor Communication | 


partition network in the box shown in Figure 11 
will accomplish such a partition. The graph in 
Figure 12 represent the desired connections. 
To obtain a connection, the two terminals linked 
by a curved line should be dispatched to two 
different sides of the permutation network, e.g. 
pair (2 7), if terminal 2 is labeled left (1) 
then terminal 7 should be labeled right (r). We 
can traverse the graph marking the terminal 1 or 
yr without considering the curved lines as paths. 
Since we have even number of straight paths, a 
partition of the terminals into two sets is always 
possible. The labeling of the graph in Figure 12 
gives us the two sets (0 2 469 10 12 14) and 
(1 3578111315). By using these two sets 
as both sides of the permutation network we can 
achieve the permutation 

0 2 4 6 9 10 12 14 , 
(157 711.5 8 13. 3) bY using the 


graph algorithm and therefore establish the 
connections. The result is shown in Figure 11. 


The full communication network is similar 
with the one proposed by Chung and Wong [CHUN 79]. 
However the interconnection network is centralized 
in the sense that all terminals are considered as 
a single group and all connections have the same 
delay (the number of switches traversed). We 
are interested in the concept of local connections, 
1.e€., processors are hierachically structured and 
local connections are expected to have minimum 
delays. We have shown that the network in 
Figure 1] can be used to get all possible con- 
nections. By structuring the network we can 
obtain an equivalent network with local proper- 
ties. To illustrate the restructuring we use 
the same network in Figure 11. The network is 
first unfolded (turn the bottom half to the right 
hand side) in order to make the picture clearer. 
Then we merge the redundant switches and rearrange 
the switch boxes. The network is then converted 
into the following (Figure 13). 


The twisted switch boxes in the center stage 
essentially serve as the purpose of straight 
through long distance connection or loopback local 
connections. Such additional switches can be 
used in other stages to provide an immediate loop 
back local connection without effecting the other 
routing in the whole network as indicated by dash 
lines in the figure. The structure of the network 
indicates that these modifications can be grouped 
in pairs as a standard form shown in Figure 14. 
Two-state switches S| and S2 are additional 
switches for loop-back purposes. There are four 
inputs (a, b, c, d) to each pair of switches. To 
perform loop-backs we have six possible connec- 
tions: (ab), (ac), (ad), (bc), (bd), (cd). 
Connections (a b) (c d) can be ruled out because 
in the graph algorithm they would have been 
routed to different center stages. Connections 
(a c) (b d) are not necessary because they come 
from the same subnetwork and if local connections 
are desired they would have been looped back in 
the previous stages. Thus, if a local loop back 
is desired, it should be either (a d) or (bc). 
In other words, if we use binary code for each 
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input, a loop back is performed only if the pair 

of processors differ in the two least significant 
bits. Switch S] and S2 in Figure 14 have the 
ability of looping back (a d) and (b c) in straight- 
state and preserving original connections when in 
cross-Sstate. 


The overall algorithm for full processor 
communication is summarized as follows: 


a. partition the terminals into two dis- 
joint sets by using the graph algorithm 


b. connect the two sets of terminals by 
using the permutation graph algorithm 
iteratively 


c. restructure the network and set switches 
for routing local connections 


The complexity of the overall process is O(N log N). 
The process establishes all connections with 
shortest possible routes for both .Jocal and long 
distance connection with twice the number of 
Switches as in a non-localized network. The 

number of switches remains 0(N log N). The mak- 

ing of the localized full communication network 

is more formally described in the following 

section. 


IV. A Formal Description of Routing in the Network 


We have shown that the interconnection net- 
work in Figure 13 can be used for permutation and 
full processor communication. We may formally 
define our switching schemes as follows: 

Let the processors be labeled 0 to 2" -1, 

For each processor, a, define its binary 

expansion as: 


€ = aa, _1-++d9aq- 
Number the stages of the switching network as 
leGseee Na IeNnenalsacace ls 
In all the networks below we define a con- 


nection and switching procedure for which a is 
Switched to 

location aa +884 at input to stage 1 
location a a + 23X15 at input to stage 2 


location aa 1 XoX1 43 at input to stage 3 


location a,x, _4...xX,a,_1 at input. to stage n-| 


where x.'s are determined by the switch posi- 
tions. 


A. Permutation Network 


For permutation the nth stage is redundant. 
Here we may map any a to b if a, = BY (0=1, T=0). 


a — 


0 ] 2 3 4 5 6 
VM VM WV 


Figure 12: Graph for Partitioning 16 Terminals. 
Stages 1 2 3 4 
— 
19 
{ 
ee VS 
3 ; 
4 


\ 

! 

t » I 
| 

’ 


\ | ot 14 
boats J\ ' Niceh Ge 3 
31 ° 5 » 16 a 37. 3 7 “ 7 ft ni 7 ae 
- ool am r aqe “py eae i! Tat ‘| 
O43 028, Ag sx, a, Oy YX a, ¥a%XAy ae or ee bby b, be by b b, 


Figure 13: A Restructured Interconnection Network with Local Properties. 


Figure 14: A Standard Circuit Which Provides Loopback and Preserves Other Connections. 


The n-1%h stage consists of an-2 switches each of 
which corresponds to the middle n-2 digits of the 
input, Xyege eX: The 4 inputs are determined 


by a, alice. the first and last bit). The 
switch is one of type which will map a, a,_7 to 
ay: y = 0 orl. 
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Thus if our routing algorithm maps a to b then 
a, = b, and 


if a maps to ax, _5-.-X,4) 4 


and b maps to ee 
Phen Xpco = Mncee Ages” neg = 1 = 


B. Full Processor Communication Without Local 
Connection 


The inputs to the nth stage are X,_4X,_o-.- 
XoX1a,. The nth stage consists of on-2 
each of which corresponds to the front n-2 
digits of the input Xie pXn-2°° Xe" The four in- 


Switches 


puts to the center stage switches are determined 
by the digits X1a,° The switch has the mappings 


mAh 
where a maps to XxX, _4...X)a, 
and b maps to ee es yyb, 


then Xan Xo7Vo and X1=Y) 


C. Full Processor Communication With Optimized 
Local Connection 


Two inputs a and b are local at stage k if 
a,=b,-- Ce ae and ay # by Let a and b be 


local at stage k, and suppose we want to connect 
them. Follow the full connection algorithm to 
stage k. Then we have 


ais at Aye Bap Xp ee Xz ay 


b is at Dae Pra Re YP 


Since i OO De and since vane the 
routing produces XpHYpeesXo=Yos Xy=¥z 5 we know 
the precise relationship of these calls already 


Stage | Stage 2 Stage 3 


i 
7 


Figure 15: A Full Communication Network With Localized Connections. 
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at the Ken stage, i.e. X1=Yy> ay =by » the local 


connection differs in the two least significant 
bits. 


We add to the network. switches to optionally 
connect at each stage k, k=l, 2, . n input 
Ly Z .Z, to ZZ 2. This exactly doubles 


n-1°°° J n-n-1"° | 
the number of switches in the network. The addi- 


tional switch type is shown in Figure 14. 

We now notice that at the nth stage, if a is 
to be mapped to b with a =b, then it will have 
been cut across by the n-I stage switches. Thus 


the mapping Xa, > oe y where y=0 or 1 is only 


necessary when a nay Thus no switch is necessary 
and we map X14, > X14, and we can eliminate the 
neh stage. 


Finally, we give one last diagram using the 
switch from Figure 14, represented by the symbol 


= » and — to represent 


this figure: Figure 15 is the 


full processor communication network with opti- 


mizvad Incal coannactinne 
Pe ewww 6 wr WEST Eee YY EWI 6 


Conclusion 


We have used a graph model in computing the 
routing for permutation network, partitioning 
network, and full processor communication network. 
Special emphasis is placed on multistage full 


interconnection network with hierachical structure. 


A rearrangeable non-blocking interconnection net- 
work with local properties is developed for full 
processor communication. Such network provides 
shortest routing for both local and long distance 
connections. The complexity of the routing 
algorithm and the number of switches used are 
both in the order of N log N. It is also shown 
that by using parallel processors, the routing 
computation time can be reduced to O(N). In 
addition to the rearrangeable non-blocking inter- 
connection network, blocking and non-blocking 
models are reviewed. Interconnect networks play 
an important role in communication and parallel 
processor systems., Further research results in 
the application of the techniques used in this 
paper to.general Clos networks with full processor 
communications and local routing are expected. 
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USE OF THE AUGMENTED DATA MANIPULATOR MULTISTAGE NETWORK FOR SIMD MACHINES 


S. Diane Smith 
University of Wisconsin, Electrical and Computer Engineering Dept., Madison, WI 53706 


Howard Jay Siegel, Robert J. McMillen, George B. Adams III 
Purdue University, Electrical Engineering School, West Lafayette, IN 47907 


Abstract ~~ The capabilities of the augmented data 


manipulator (ADM) and the inverse ADM C(IADM) as 
permutation networks for SIMD machines are ex- 
plored. Redundant control settings for commonly 


used permutations are examined. <A method to count 
the number of distinct permutations performable by 


these networks is given. Finally, techniques’ for 
controlling these networks in SIMD mode are 
presented. 
I. INTRODUCTION 
In C13] it is shown that the multistage cube 
networks called the generalized cube, omega, in- 
direct binary n-cube, and STARAN’ flip are 


equivalent and that the capabilities of the aug- 
mented data manipulator (ADM) network are a super- 
set of those of these multistage cube networks. 
In this paper, the use of the ADM in an SIMD en- 
vironment is studied. 

An SIMD (single instruction stream-multiple 
data stream) machine has a_ control unit which 
broadcasts instructions to N processors. A pro- 
cessor along with its private memory is called a 
processing element or PE. ALL active PES execute 
the same instruction at the same time, each pro- 
cessor on data from its own memory. Data can be 
transferred by the interconnection network from PE 


to PE. Each PE is assigned a unique address’ from 
0 to N-1, where N=2", 
An interconnection network can be described as 


a set of interconnection functions, where each 
interconnection function is a permutation (bijec- 
tion) on the set of PE addresses C8]. When inter- 
connection function f is applied, input i 14s con- 
nected to output f(i) for all i, O<i<N, simultane- 
ously. An equivalent definition is that the in- 
terconnection network takes the set of PE ad- 
dresses as its input and produces as its output a 


permutation of these PE addresses, i.e., it maps 
an input address to an output address. 

The Plus-Minus 2' (PM2I) network consists of 
the 2n functions defined by 
PM2,.(j) = j+2' mod N and PM2_.(j) = j-2' mod N 
for 0<j<N, O<i<n [8], where (-x = N-x) mod N. 

The data manipulator network [2], Fig. 1, con- 


sists of n stages with N switching cells per 
stage, plus a column of network output cells. The 
stages are ordered from n-1 to 0, where the inter- 
connection functions of stage i are PM2,., PM2_;, 
This work was supported by the Air Force Office of 
Scientific Research under AFOSR-78-3581. The U.S. 
Government is authorized to reproduce and distri- 
bute reprints for Government purposes notwith- 
standing any copyright notation hereon. 
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p= P4++=P4Pp, O<P<N. 
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Figure 1; 


The data manipulator 
network for N=8. 


and the identity (straight). There is one pair of 
control signals perstage. At stage i, cells whose 
i-th address bit is 0 respond to one control, the 
other cells to the other control. 


The augmented data manipulator (ADM) is a data 
manipulator with individual cell control 
C9,11,13,14]. Each cell receives control signals 


independently of any other cell. 
If the stages of the ADM are traversed in re- 
verse order, i.e., the input stage is stage 0 


(PM2, 1) and the output stage is stage n-1 


ADM C(IADM) [15]. 


resulting network is the inverse 


il. CAPABILITIES OF THE ADM AND IADM 
Lemma 1: The ADM passes the permutation f if and 


Proof: See (151. 

Theorem 1: The ADM can perform a 
in one pass through the network. 
Proof: The perfect shuffle interconnection func- 
tion is shuffle (p. 4 .--P4Pp) = Pp_ae*=P4PQP,_4- 
The switch for 


stage i, n>i>0, are determined as follows, where 
the address of a cell P at stage i is Prnqe*sP4Po- 


perfect shuffle 


settings 


straight across; 
-1 until 0 do 


set stage n-1 to 
for i = n-2 step 

1F Pigg?P, 
then if P. 4470 


then set cell P at stage i to PM2, 47 
else set cell P at stage i to PM2_ ae 


else set stage i to straight across; 

For the controls calculated from the algorithm, 
data originally from PE Pray cre Py Py is sent to 
cell Pr-2Pp—-32**PsPp_4Ps_4*22PQ at stage i. This 
algorithm is related to the "PM2I + shuffle” algo- 
rithm in [10] and is proved correct in (15J. 
Corollary 1: The IADM can perform an inverse per- 
fect shuffle in one pass through the network. 
Proof: Follows from Lemma 1 and Theorem 1. 

Theorem 2: The IADM cannot perform a perfect shuf- 
fle in one pass through the network. 


Proof: Assume arithmetic is mod N. Consider P = 


9n-244, where the superscript is a repetition fac~ 


tor, CeGe, 0411 = 000011. The difference of the 
addresses P and shuffle(P) is an odd number. 
Since no combination of PM2 and PM2_., 0<i<n, 


difference as the shuffle 
_ The 


distance between P+1 and shuffle(P+1) is even, as 
is the distance between P-1 and shuffle(P-1). The 
straight connections are used for the data from 
P+1 and P-1 at stage 0, creating a conflict. 
Corollary 2: The ADM cannot perform an inverse 
perfect shuffle in one pass through the network. 
Proof: Follows from Lemma 1 and Theorem 2. O 
The generalized cube and its equivalents L15J 
cannot perform the shuffle or inverse shuffle (for 
N>16, O0p 3+ -P41 and 10p ,.3-+=P41 conflict at 


for the shuffle, and Ip,.g+**Po01 and 


yields an. odd number 
does, data from P must use PM2,, at stage 0. 


stage n-1 
Ip age*Po11 conflict at stage 1 for the inverse 


shuffle). | 
Theorem 3: A bit reversal function transfers 
from PE P= Pi-qe**P4PQ to Pp = PoPq+**Py-4> For 


N>8, the IADM cannot perform a bit reversal in one 


pass through the network. 


Proof: Let P = 0°-244_. The distance between P and 
its its bit reversal is an odd number, so PM2 49 must 


be used. The distance between P+1 and its bit re- 
versal is an even number, as_ is the distance 
between P-1 and its bit reversal. The straight 
connections are used for the data from P+1 and P-1 
at stage 0, creating a conflict. 

Corollary 3: For N>8, the ADM cannot perform a bit 
reversal in one pass through the network. 
Proof: Follows from Lemma 1 and Theorem 3. C) 

For some transfers, more than one 
ists for the ADM. 


theoretical interest, the existence of redundant 


paths adds a certain amount of fault tolerance. 
Two classes of these redundant settings are shown 
(details in (15). 

Theorem 4: There are n~i different control  set- 
tings for the ADM which realize the Cube. inter- 
connection function, 0 < i < n-2. 


Proof: Cube, (P4=+-P4PQ) = PpaqessPaeeePgs O<i<n 
C8]. Cube. can be realized by setting the ADM 


controls such that at stage i, cells whose i-th 
address bit equals 0O perform PM2,., while those 


data 


setting ex- 
In addition to being of 


whose i-th address bit is 1 perform PM2_.. 


gi = gkagk-t_ 
ferent settings 
Cube... Data 

puts, P and P', 
Case 1: Pi=p.'. 
P;#p.'. 
must be in cells which differ in at least the i-th 
bit position (since p;#p.'). At stage i, the data 
from P will be at cell Cube . (P) and the data from 


Since,. 


72, n>k>i, there are n-i dif- 
for the ADM which accomplish 
items from an arbitrary pair of in- 


P<P', cannot conflict. 
Always P'=-P cells apart. 


Case 2: For stage j, j>i, the data items 


P! will be at Cube,(P'), which will differ in at 
least the i-th bit position. 
The uni form shift permutations send data _ from 
PE P to P'. = P+A mod N, O<A<N, for all PEs. Let A 
an 109 294aQe © 
Theorem 5: The ADM has redundant control settings 
for all uniform shifts of A mod N, O<A<N, — | 
Proof: The ADM can be set as follows: at stage i, 
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if 4.70, then set the network to straight across; 
if a,=1, then set the network to, PM2,.. Let A be 


expressed in signed digit notation, where 
a,' {0, +1, -1}, the sum and difference of 


powers of 2 (e.g., A = 0111 = 100(-1) = 10(-1)1 = 
1(-1)11). The following are all equivalent (7]: 
a’ ,...a', 01...110a' 
nmi K 
a De ee 
fee ee 


Gi saat IG 1)1...10a' ja 0 


Each of these different siavecent aeions of A can 
be used to yield control settings for the ADM net- 
work as follows: at stage i, if a.'=0, then set 


stage 7 to straight across; if a, '=1, to PM2, 33 if 
a.'=-1, to PM2_:. 


set the same way, no conflicts can occur. 
Corollary 4: Theorems 4 and 5 hold for the IADM. 


Cube; '=Cube., and the 


inverse of shift A mod N is shift N-A mod N. 

One measure of a network is the number of per- 
mutations it can perform. The generalized cube 
network (and its equivalents [13]) can perform 


gNn/e permutations [12]. The following theorems 
consider the number of permutations performable by 
the ADM (details in (1]). 
Lemma 2: For N = 4, the ADM can perform all possi- 
ble N! = 24 permutations. 
Proof: By enumeration (see (151). C) 
A size N ADM can be partitioned into two in- 
dependent subnetworks of size N/2 [11], plus stage 
0. These subnets have the same structure as a 
size N/2 ADM. ALL the inputs of one subnet are 
even-numbered (the even subnet). The subnet with 
all the odd-numbered inputs is the odd subnet. 
The connection of the two subnets to stage O of 
the size N ADM is shown in Fig. 2. ALL even- 
numbered inputs of stage 0 are 


sueatgeehet ae 
U 


0 


Since all are in a stage are 


Proof: Follows from Lemma 1, 


connected to the 
outputs of the even subnet and all odd-numbered 
inputs to the outputs of the odd subnet. 

Let Ss, D. specify a source/destination pair. 


$ 
Even es t 
- a 
Subnetwork § 4 9 
e 
Figure 2: 0 
Partitioning 
the ADM network. I 
n 
p 
u 
t 
$s 
A connection in stage 0 that does not affect the 


low order bit, i.e., (So); = (dy) z- is a straight 


connection. A connection that changes the Low 
order bit, (So); # (do), is called an exchange 


A regular exchange is between stage 
O inputs P = p,_4p,_5...p40 and P+1. An irregular 


exchange is between stage 0 inputs P and P-1 mod 
N. <Any possible configuration of stage 0 that is 
a permutation, except the all +1 or all -1 confi- 
gurations, consists of straight and exchange con- 
nections only [11] and can be expressed as an N- 
bit number. A bit is associated with each adja- 
cent pair of inputs, including the wrap-around 
pairing of 0 and N-1. If the adjacent pair of in- 
puts form an exchange, the bit is 1; if not, 0 
(see Fia. 3). 

Two kinds of adjacency for binary numbers are 
distinguished. When the first and last bits of 
the binary number (representing the wrap-around) 


(see Fig. 3). 


are not considered adjacent it is Linear 
adjacency. When the first and last bits are con- 


sidered adjacent it is circular adjacency. 

Lemma 3: Every configuration of stage a, except 
the settings all +1 or all -1, that is a permuta- 
tion, has a unique associated binary number with 
no circular adjacent bits that are 1. 

Proof: If there are circular adjacent 1's, then an 


input P is in two exchanges such that P + P+1 mod 
N and P + P-1 mod N. 

Lemma 4: The number of N-bit numbers with no 
Linear adjacent 1's. is 
B(N)=B(N~1)+B(N-2); B(2)=3, B(3)=5, N>4. 

Proof: If the number ends in a QO, it must have no 
Linear adjacent 1's in the first N-1 bits. If it 
ends in a1, the bit immediately preceding must be 


a QO, and the first N-2 bits must have no Linear 
adjacent 1's. 

Lemma 5: For an N-input ADM network, the number of 
stage 0 configurations that yield a permutation of 


stage 0 inputs to outputs is 
a(N) = B(N) ~ B(N“4) + 25 N > 8. 
Proof: By Lemma 3, a(N) is the number of N-bit 


numbers with no circular adjacent 1's, plus all +1 


Figure 3: 


a) Straight connections 
b) Regular exchange 

c) Irregular exchange 
Also shown, the 
associated binary number 


(N = 8). 
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B(N) exceeds the number with no cir- 
cular adjacent 1's by the number with no Linear 
adjacent 1's which do have circular adjacent 1's. 
These numbers are of the form 10a4a5..0ay_,01 


© 


Lemma 6: Consider the stage 0 permutations except 
the all irregular exchanges, all +1, and all -1. 
Any two of these permutations differ in the source 
subnetwork for at least one output. 

Proof: Consider two distinct permutations of the 
given set. There must be at least one output D. 


and all -1. 


where a,...ay.4 has no Linear adjacent 1's. 


which is mapped differently. If output D. is con- 


nected to a straight stage QO connection, 
(dp); = (Sq) 3. If it is connected to an exchange 
at stage 0, (do); # (sg);, and it receives its 


data from a different source subnet. 


Theorem 6: A lower bound on the number of distinct 
permutations performable by the ADM, P(N), is 
P(N) > P(N/2)2*[a(N)-31; P(4)=24; N>8 . 

Proof: Each subnet can perform P(N/2) permuta- 
tions. Let stage 0 be restricted to any permuta- 
tion other than all +1, all -1, or all irregular 


exchanges; there are a(N)-3 such configurations. 
By Lemma 6, any change in the stage 0 setting will 
cause at least one output to be mapped from a dif- 
ferent subnet, changing the overall permutation. 
P(4) is from Lemma 2. 

Theorem 7: An upper bound on the number of dis- 
tinct permutations performable by the ADM is 


P(N) < P(N/2)°*a(N); P(4)=24; NDB . 


Proof: Assuming that the composition of any input 
permutation with any stage 0 permutation yields a 
unique overall permutation gives the above result. 
The <inequality is because the assumption is false 
(there are redundant settings). 


IIIT. NETWORK CONTROL 


Routing tags are used to distribute control of 


the network among the N PEs. A full routing tag 
= fon-tfon-2-++f4 FQ at each input can specify any 
arbitrary path. In stage i, if 5,70, the 


straight Link is used; if f5.=1 and f554470, the 
+2’ Link is used; otherwise the -2' Link is used. 
If all the sign bits in a full tag are the same, 
form an nt1 bit routing tag by computing the 
signed magnitude difference between destination D 


and source S: T = tit iaqeeetytg = D-S, where t ,=0 
t 31 
theteestyty equals the absolute value of D-S (5]. 
At stage i if t;=0, the straight 


indicates positive and negative, and 


connection is 


used; if t,=0 and t.=1, the #2' Link is used; oth- 


erwise the -2' Link is used. If all N tags for a 
permutation are calculated in this way, then the 
permutation is routed using natural routing tags. 
An individual route consisting of only straight or 


+2'=-type connections is positive dominant; an in- 
dividual route consisting of only straight or 


-2'-type connections is negative dominant [5]. 


Two tags are equivalent if they route a message 
from the same source to the same destination. 
Theorem 8: Let A' denote the two's complement of A 
and T#0 (S#D). Then T' is equivalent to T. 

Proof: See [5]. : C) 


~ A permutation is routed using positive dominant 


routing tags if those tags that are negative dom- 
inant in the set of natural routing tags are con- 
verted to positive dominant using Theorem 8. 
Lenfant has defined five families of frequently 
used permutations C4]. Theorems 9 to 12 show that 
two of the families are passable by both the ADM 
and IADM using positive dominant tags. The proofs 
are very briefly sketched and the details are in 
C6]. Let (4,,4,) be the bitwise representation of 


an address P, &, the j high order bits, and 4, 


the n-j low order bits. <I? denotes T mod 2". 


Lemma 7: The location of a message in stage i of 


the IADM is cell <S + <T>.>,, where T is the mag~ 


nitude portion of its positive dominant tag. 
Proof: At stage i, bits 0 to i-1 have been exam- 
ined, so the message has been displaced by <T>, <)> 
Theorem 9: The class of permutations an Ke which 
maps X to jXtk mod N (j odd), is sauce by the 
IADM using positive dominant tags. 
Proof: Lemma 7 is used to show no 
OCCUPa | 


Theorem 10: The class of permutations a which 


ok? 
maps X to jXtk mod N (j odd)yis seseecle by the 
ADM in one pass using positive dominant tags. 
Proof: Lemma 1, Theorem 9, and properties of the 


ring of integers mod N [3] are used to show i “Tay 


and the classis passable. 


(n) 


Theorem 11: The class of permutations Ss Ke which 
? 


maps (85,44) to (a5, k D (349) (j<n, is pass~ 
able by the IADM using positive dominant tags. 
Proof: It is shown that if 
00...0k, 3 qeenkgkos otherwise it is 11.201k 
saqereek Kane 
flicts 
(n) 


Theorem 12: The class 6. 
—<—— j,k 


This is used to demonstrate no 
an occur. 


con 


(j <n) is passable by 
the ADM using positive dominant tags. 


Proof: Lemma 1 and Theorem 10 are used to show 
gl and the classis passable. © 
Positive dominant routing tags cannot be used 


to route all passable permutations without con- 
flict (e.g. perfect shuffle). 

Theorem 13: The perfect shuffle is passable by the 
ADM network using natural routing tags. 

Proof: If P_-q=!- T = (shuffle(P)-P)<0, i.e. T is 


negative dominant. In Theorem 1, if P_-771, the 
bit pair Ps 44P; will always be of the form 10 or 
11. The algorithm specifies settings of -21 and 


straight respectively, representable by a negative 


dominant tag. The case for P_-470 is similar. 


conflicts can. 


a, < 2°"-J-k, the tag is 


Corollary 5: The inverse shuffle permutation is 
passable by the IADM using natural routing tags. 
Proof: Follows from Lemma 1 and Theorem 13. 

The tags used in Theorems 9 to 13 require only 
n+1 bits and are easy to compute. If a passable 
permutation is needed, but cannot be represented 
with natural or positive dominant tags, full rout- 
ing tags can be precomputed. 


IV. CONCLUSIONS 
The use of the ADM and IADM networks’ for 
processing have been explored. Analyses such as 
these are necessary in order to evaluate the 
cost-effectiveness of the ADM (and IADM) as SIMD 
interconnection networks. 


SIMD 
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Summary 
A connection network is described for 
connecting any or all of a large number of 
processors on one side to a large number of 
memory modules on the _ other. Each processor 
independently requests connection through the 
network. Response time is to be commensurate 


with the access time of memory, and hence no time 
can be allowed for global control of the network. 
The connection is made at combinatortal logic 
speed, and the connection held for accessing one 
word only. In the specific case studied, a 
network to be embedded in the Flow Model 
Processor of the Numerical Aerodynamic Simulator, 
there were 512 processors and 521 memory modules, 
with an assumed memory access time of 240 ns. 


Pie lis 

The selected network (the "baseline" 
network of [3]) is isomorphic to the Omega 
network of [4]. Figure 1 shows an example of 


this type of network together with an example of 
how the individual bits of the requested memory 
module number control the connection being made 
through each two-by-two node. To avoid the 
hazards of designing with arbtters and synchro- 
nizers, the connection network its synchronized by 
a clock, whose cycle time exceeds the roundtrip 
delay through the net, but may be substantially 
shorter than the memory access time. The bidirec- 
tronal path through the network is latched up 
with the acknowledge bit from the memory module 
while addresses, memory commands, and data are 
transmitted. A path width of 11 btts was chosen. 
This width ts wide enough to allow the module 
number and ae strobe through the network in 
parallel and provides sufficient bandwidth for 
the balance of the system. 


The entire collection of processors will 
run no faster than the slowest processor due to 
points of synchronization within the programs 
being executed. An important constraint on the 
network is that it treat processor requests 
fairly since a slow processor will slow the whole 
system; in the applications studied, all pro- 
cessors had equal amounts of computation to do. 


(a) The information in this paper was previously 
submitted to NASA Ames as the final report of 
contract NAS2-9897. oF 
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-slightly. 


Thus on the average no node, in the network may 
favor one input port over another. For redun- 
dancy and additional bandwidth the CN ts assumed 
to be duplexed. 


The performance of the Connection Network 
was evaluated by simulation. The simulators 
collected data on the effect of blockage in the 
network on processor throughput, particularly the 
effect on the last processor to finish. The 
simulator generated its own test cases. Table 1 
shows the result when the simulator ts presented 
with a single access from each processor, and 
then runs to completion. The test cases included 
the three data access patterns that dominated the 
aerodynamic flow programs furnished by NASA, as 
well as the case where the 512 processors request 
access to memory modules which have been selected 
at random. Figure 2 shows the result for one 
case in which each processor has a number of 
random memory access requests. The three curves 
in Figure 2 are R (the number of processors 
making a request on this memory cycle), M (the 
number of different memory modules represented in. 
these requests (some processors request 
connection. to the same memory module, and thus 
conflict with each other)), and Z (the number of 
processors which the connection network succeeds 
in connecting to some memory module). Z/M ts the 
fraction of .successes versus maximum possible 
number of successes. In all these simulations, 
the network was duplexed. 


Simulations 


reported by Harris and 
Zichterman [5], and reproduced here by 
permission, are shown in Fig. 3 and 4. In this 
case, the processor queues were filled by 
requests (and the associated timings) generated 
by a simulator of the FMP processor. In this way 
the test case had realistic timings. Fig. 3 


shows six accesses during this first iteration of 
a particular segment of code. Fig. 4 shows the 
fourth iteration of these same six access 
patterns. The spread of access times represents 
the processors getting . slightly out of 
synchronism with each other as some get delayed 


The network, with NlogN complexity, 
been validated for application to a 
many-processor multiprocessor with success 
only for the aecess patterns exhibited in 
targeted areodynamic flow code applications, 
also for random patterns of accessing. 
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Abstract 


In this paper we describe a memory system 
designed for parallel array access, and first 
used in the Burroughs Scientific Processor. The 
system is based on the use of a prime number of 
memories to allow conflict free access, and a 
powerful combination of indexing hardware and 
data alignment switches. The use of a prime nun- 
ber of memories causes certain difficulties in 
addressing hardware, and particular emphasis is 
placed on the memory indexing equations and 
their implementation. 

1. Introduction 

The problem discussed in this paper is the 
design of a memory system that can access, in 
parallel, the required sections of an array, 
e.g., a row, column, diagonal, etc. A number of 
these memory systems have been discussed in the 
literature. In [Batc77], Batcher discusses a 
scheme for allowing access to words, bit slices, 
or "byte" slices of a two-dimensional bit array. 
Feng described another scheme for assessing 
various "slices" of data in [Feng74]. Other work 
described in [Ston71], Swan74], [Lang76], 
[LaSt76], [Orcu76], [Sieg77], [Lawr75], and 
[Shap75] has treated the problem from a variety 
of viewpoints. However, all of these designs 
have restrictions either on the kinds of 
"slices" available without memory access con- 
flicts or in the data alignment capabilities. 

In [BuKu71], Budnik and Kuck observed that 
if the number of memory modules is a prime 
number, then access to any "linear" array 
slices can be achieved without conflict (provided 
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that the memory ordering of the desired array 
elements is relatively prime to the number of 
array elements). This observation turns out to 

be quite useful. However, the problem of addres- 
sing this type of memory turns out to be diffi- 
cult due to the need to do integer divisions and 
modulo operations in the addressing hardware. In 
this paper we will discuss these problems in more 
detail, and will present a feasible implementation 
of the prime memory system. 

Since many of the ideas in this paper have 
been incorporated in the design of the Burroughs 
Scientific Processor (BSP), we will describe some 
of the details of the memory, alignment, and 
indexing hardware of this machine. The BSP is a 
high performance computer designed to be espec- 
ially effective on vector processing applications, 
without significantly impairing its performance on 
scalar computations. As can be seen in Figure 1, 
the BSP consists of sixteen processing units, 
seventeen memories, two alignment networks, and a 
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central control and scalar processing unit. The 
control unit includes a fully functional scalar 
processing unit which can be overlapped with 
vector operations, and additional memory for 
scalar data and program storage. (See [KuSt79] 
for further details.) Special hardware is 
included in the control unit to perform vector 
addressing and alignment control, and these 
operations can be overlapped with vector and 
scalar processing. We refer to the alignment, 
indexing, and memory systems collectively as the 
AIM system. We will discuss this system in more 
detail in Section 3. 

The alignment networks shown in Figure 1] are 
in reality crossbar switches controlled by source 
tags. (That is, each output port of the network 
can supply a "tag" which specifies the number of 
the input from which it needs data.) While in 
general, crossbar switches are too expensive for 
large arrays of processors, due to the relatively 
small number of processors in the BSP it was 
determined that crossbar switches were the most 
cost-effective form of switch capable of perform- 
ing all the desired alignments. 

In particular, the functions such as com- 
press, expand, merge, require a random aligning 
pattern which only the crossbar switch could per- 
form efficiently in the allocated time. Other 
forms of switches were investigated, e.g., the 
Swanson network [Swan74], Omega switch [Lawr75], 
Barrel Shift network, etc., but these switches 
do not perform all the functions needed in the 
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2. The Storage Scheme and Associated Equations 


By a storage scheme, we mean the set of rules 
which determine the module number and address 
within that module where a given array element is 
stored. For the present, we will restrict our 
attention to two-dim@nsional arrays. However, 
generalization of these storage schemes is 
trivial for higher dimensionéd arrays. 

Figure 2 shows an 8x8 array stored in 5 
memory modules using one storage scheme. Notice 
that any 5 consecutive elements of a row, column, 
diagonal, etc., 411 lie in separate modules, and 
thus can be accessed in parallel, i.e., without 
conflict. For example, the second through sixth 
elements of the first row are stored in module 
numbers 3, 1, 4, 2, 0, and at addresses 2, 4, 6, 
8, 10, respectively. 

We begin with some definitions. Let M be 
the number of memory modules and P be the number 
of processors, where we assume P < M and M is 
prime. There are two storage equations, f(i,j), 
and g(i,j) which determine the module number and 
address, respectively, of element (i,j) of the 
array. In our case, we have the following 
equations: 


[j * I+ i+ base] mod M 
[j * 1+ i+ base]/P 


(1) 
(2) 


f(i,j) = 
AG ite» a 
where we assume the array is dimensioned (Td); 


"base" is the base address of the array, and P is 
the greatest power of two less than M. Notice 


Figure 2. 
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Example of an 8X8 Array Stored in 5 
Memory Modules 


that these equations require a MOD M operation 
where M is a prime number. They also require an 
integer divide by P operation. However, P is a 
power of two which makes this divide easily im- 
plementable. This simplification is made possible 
by the "holes" shown in Figure 2. 

Clearly, the number of holes in each row of 
the memory is equal to M - P in general. For 
example, if M = 37 and P = 32, then 5/37th of the 
memory is wasted. These holes could be filled 
with other data, e.g., scalar data, but a cleaner 
solution is available at the expense of an 
increase in the complexity of the indexing equa- 
tions (see [LaVo79]). 

Next we define a linear N-vector, or simply 
an N-vector, to be an N element set of the ele- 
ments of the array formed by linear subscript 
equations: 


V(a,b,c,e) = {A(i,j): i= ax+b, 

j =exte, 0<x<N} (3) 
where again we assume the array is dimensioned 
A(I,J). Thus, if a= b= 0 and ec =e 1, then 
the N-vector (N = 5) is the second through sixth 
elements of the first row of A: A(QN,1), A(0,2), 
..-, A(O,5). If a =c = 2 and b = e = 0, then 
the N-vector (N = 4) is every other element of 
the main diagonal of A: A{0,0), A(2,2), ..., 
A(6,6). Notice that the elements of the N-vector 
are ordered with index x. 

Next we define the index equations for the 
N-vector V. We define A(x) to be the address, in 
module U(x), of the x-th element of the N-vector. 
Thus combining equations (1) through (3) above, 
we get: 


f(ax + b, cx + e) 
[(cx + e) * I + (ax + b) + base] mod M 
[dx + B}] mod M | (4) 


u(x) 


u 


where d = a + cI and B = b+ eI + base. We define 
d to be the order of the N-vector, and B to be the 
initial address. Next we get: 


f(ax + b, oe + e) 
[(cx + e) * I + (ax + b) + 
base]/P (5) 


a (x) 


It is easy to show that if d is relatively prime 
to the number of memory modules, then access to 
the N-vector can be made without memory conflict. 
(See [BuKu71] and [Lawr75] for a proof.) 

Since it is most convenient to be able to 
generate the address a(x) in memory u(x), we solve 
for x in terms of wu and get: 


x(u) = [Cu - B)d"] mod M (6) 


where d' is the multiplicative inverse of d modulo 
M. Substituting this into equation (5), we get: 
{(a + Ic) (u - B)d' mod Mj] +b + 

el + base}/P 
{d[(u - B)d' mod M] + B}/P (7) 


a () 


For example, consider the 5-vector V(0,0,1,1,), 
i.e., the second through sixth elements of the 
first row of A(8x8). We have B = 8 and d = 8, 
thus 


u(x) = [x + 8) * 8 + 0] mod 5, 
[(x + 8) * 8 + 0)/4, 


a (x) 
and since d' = 2 (i.e., 2 * 8 = 1 mod 5), we get: 
a(u) = {8[2(u - B) mod M) + 8}/4. 
Thus, u(x) = (3, 1, 4, 2, 9), 


(2, 4, 6, 8, 10), 


a(x) 


and 


ii 


a(u) = (10, 4, 8, 2, 6). 

Notice that the proper addresses in memories 0, 1, 
.--, 4, are 10, 4, 8, 2, 6, respectively. We use 
the u(x) equation in the x-th processor to deter- 
mine the module number of the memory containing 
the x-th element of the desired Mvector. At the 
same time, addressing hardware in memory U uses 
the a(u) equation to determine the necessary 
address of the desired element. We use a(i) 
instead of a(x) because this eliminates the need 
to route the addresses from the processors 
through the switch. 

This process is reasonably straightforward, 
except that it is not obvious that the hardware 
can do the necessary calculations efficiently. 

In Section 3, we will describe how we partition 
the equations into parts that can be done 
separately by special hardware in the CU, AU, 


and memory addressing box. 


3. Indexing Hardware 


Vector instructions in the BSP are designed 
to allow processing on vectors of arbitrary 
length. The control unit automatically sequences 
vector operations as a series of superword opera- 
tions where a superword consists of 16 or less 
vector elements. For example, a vector instruc- 
tion which specifies a vector of length 53 would 
be sequenced as three superwords of 16 elements, 
followed by a superword of 5 elements. 

Associated with every array is an array 
descriptor (AD), shown in Figure 3(a). The two 
values in the AD describe the base address and 
total volume (words) of the array, and are used 
for addressing and bounds checks on the array. 
Every vector instruction refers to at least one 
and as many as six vector operands. Each vector 
operand is referenced through a vector set des- 
criptor (VSD), shown in Figure 3(b). The VSD 
actually describes a set of vectors from a given 
array. Bis the address of the first element of 
the first vector in the set. This vector is 
ordered with distance d, and contains L elements. 
The first element of the second vector in the set 
is the (signed) distance D from the first element 
of the first vector. There are K vectors in the 
set. Thus the VSD describes a two-dimensional 
set of data. 


Figure 3(a). Array Descriptor 
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Figure 3(b) Vector Set Descriptor (VSD) 

For example, the VSD (B= 1, d = 8, L= 8, 
D = 2, K = 4) describes the odd numbered rows of 
the array A(8,8) shown in Figure 2. Similarly, 


VsD (B= 0, d=1, L= 8, D= 16, K = 4) describes 
even numbered columns, and VSD (B = 0, d = 0, 

L= 8, D=1, K = 8) describes a two-dimensional 
set of data, X(i,j), where X(i,j) = A(i,1l), 

0< i, j < 8, and A(i,j) is the array shown in 
Figure 2. The above parameters are not all 

stored together. The first step in preparing a 
-vector instruction is to compute the above para- 
meters, together with other values needed for 
addressing and alignment. This is greatly facili- 
tated by special-purpose indexing hardware. 

The purpose of the indexing hardware is to 
generate alignment tags and memory addresses for 
vector access. Consider first the input align- 
ment network. To access a superword, processor p 
must generate an input alignment tag, IAT, which 
specifies the memory module number of the p-th 
element of the superword, i.e., U(p). At the 
same time, the address of the p-th element, a(p), 
is generated in memory u(p). Notice that each 
processor could generate the required address 
using equation (5), and then route this address 
to the proper memory through the output alignment 
network. However, by using equation (7), we | 
avoid the extra routing operation. © 

The output alignment network works similarly. 
Memory u(p) is to receive the p-th element of a 
superword, and thus generates an output alignment 
tag, OAT, whose value is computed from equation 
(6) above. Each memory also computes the required 
address, a(v), for storing the output. 

The alignment, indexing, and memory systems 
are responsible for a number of other functions. 
We will discuss these functions in a later sec- 
tion. For now, we will restrict our attention 
to accessing linear N-vectors. 


3.1 Linear N-Vector Access 


Let us assume for the moment that we are 
interested in access to a single superword, with 
initial base address B, and with order d. If the 
superword is to be fetched from the memory, then 
for each memory U, we must generate an address 
(see equations (4) through (6)) 


{B+ p(u) ° d}/P (8) 


(u - B) d' mod M (9) 


a (uy) 


il 


where p (uv) 
and for each processor p, we must generate an IAT 
u(p) = (B+d* p) mod M (10) 


However, if the superword is to be stored in the 
memory, then for each memory LU, we must generate 
an address given by equations (8) and (9) and for 
each memory L we must also generate an OAT 


p(u) = [G - B) d'] mod M (11) 


Thus M-addresses and P-IAT's or M-OAT's are 
required to access a superword. In the next sec- 
tion we will show how the generation of these 
values can be simplified. 


3.1.1 Recursive Generation Technique 


Consider the equation (10). Substituting 


(p + k) and (p - k) for p, we get 


al 
'g 
I+ 
ya 
Soa 
i 


[B+d* (p + k)] mod M 
[utp +k ¥1)+d] mod M ~~ (12) 


Equation (12) implies that u(p + k) can be 
generated from uU(p) with modulo M addition/sub- 
traction operations instead of a multiply followed 
by a modulo M addition. Extending the notion, 
from any U(p) all tags can be generated recur- 
Sively with appropriate modulo M additions or sub- 
tractions. In practice, primary u(p) for several 
values of p are generated using equation (10), 
and secondary u(p) for the remaining values of p 
are generated using equation (12). The number of 
primary u(p) versus the number of secondary u(p) 
calculated can be determined by a simple hardware 
versus time tradeoff. 

The same technique can be applied to generate 
output alignment tags and memory addresses. The 
equation for the OAT's is: 


p(u + k) = [pQ@ +k #1) + d'] mod M (13) 
For memory addresses, the equation is: 


aqut+tk) = (B+ {IpQ@tkZi1) #4") 
mod M}z d)/P (14) 


3.1.2 BSP Implementations 


For the BSP, P = 16 and M= 17. The base 
address, B, is a 23-bit value. Flement displace- 
ment, d, is a 23-bit signed quantity. For timing 


and hardware considerations, 4 initial memory 
addresses, 4 IAT's and 4 OAT‘s are generated by 
using multiplications and modulo and normal addi- 
tions. Other addresses and tags are generated by 
using binary adders. To use the binary adders, 
the equations described in the previous section 


were further simplified as follows. Let 6 = 
d mod M, and notice that u(p) < M. For IAT's, 
we get 

u(p + k) = u(p) + kd - cM ~ (15) 
where cM < uU(p) + kd < (c +1) M 


For example, assume M= 17. We might generate 
primary u(p) for p = 1, 4, 7, 10, 13, 16 from 
equation (10). Secondary u(p) for the remaining 
values of p would be generated as follows from 
equation (15). 


u(p + 1) = u(p) + 6 corrected by -17 if 
u(p) + 6 > 17 
u(p - 1) = u(p) - 6 corrected by +17 if 


utp) + 6 < 0 


Equations for OAT's are the same as above 
except 6 is replaced by d'. For memory address 
generation, the equations are as follows. Let 

At) = B+ p(y) ° dso that a(u) = A(u)/P. 


Then 


B+d-s p(u + k). 


A(u + k) = 
= B+ d([d'(u + k - B] mod M) 
= B+ d([p(u) + dk"] mod M) 
= A(u) + kdd' - dceM 
where (16) 


cM < p(y) + kd' < (c + 1)M 


depends on the quantity p(y) + 2d' = d'(u - B) 

mod M + 2d' (see equation (16)). d'(u - B) mod M 
(available from the primary address generator) and 
d' are each 5«bit quantities and are used as 


Figure 4. 


Address generation in the BSP is performed 
as follows. Primary A(u) are generated for yu = 2, 
6, 11, and 15 as shown in Figure 4. Then secon- 
dary values are generated for K = +1, +2. 
Notice that A(u) for u = 4 and 13 are each gener- 
ated twice. This redundancy is used to check the. 
hardware integrity by comparing duplicated values. 
(In addition, modulo 3 checks are performed on all 
additions to further verify hardware integrity.) 

A primary A(u) generator is shown in Figure 
5. B mod M and d' are each 5-bit quantities 
(since M = 17) and are supplied by the Central 
Index Unit (to be discussed in the next section). 
The quantity d'(u - B) mod M is supplied by a 
1024x5 bit ROM. (The ROM contents differ for each 
primary wu.) 
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Figure 5. A Primary Address Generator 


A secondary address generator for A(u + 2) 
is shown in Figure 6. Notice that in equation 
(16) a test. is required to determine the quantity 
added to (or subtracted from) A(u).. This. test 
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address inputs to a 1024x5 ROM. The output of the 
ROM determines the test result and is used to mul- 
tiply the necessary additive factor for the final 
adder. The other secondary address generators for 
A(u +1), A(u - 1), and A(w - 2) are similar to 
the one shown in Figure 6. However, the A(y + 1) 
generators only need 2-way multiplexers and a one- 
bit wide decision ROM. (Through further simplifi- 
cation, these decision ROM's can be reduced to 512 
words, so that the total decision ROM for four 
secondary generators is just 6512 bits.) The 
primary and its four associated secondary address 
generators are all grouped together physically. 


A(w)=8 +d [d'(p-B) Mon M] d'(2-B) MOD M 


1024x2 
ROM 


DECISION BITS 


ADDITION 


A(ute2} 


Figure 6. Secondary Address Generation for 
A(u + 2). 4 | 


Generation of IAT's and OAT's is essentially © 
the same or simpler than address generation. Only 
the values and number of bits change. One group 
of hardware, described above, generates the 
addresses, and a similar group of hardware gener- 
ates both the IAT's and OAT's. Both groups of 
hardware form part of the Central Index Unit that 
will be described next. 


3.1.3 The Central Index Unit 


One of the components of the control unit is 
the Central Index Unit (CIU). The purpose of the 
CIU is to a) perform automatic indexing of multi- 
ple superwords; b) generate input and output 
alignment tags; and c) generate 4 initial memory 
addresses and indexing constants. The CIU can 
be divided into 4 major sections: 1. Descriptor 
Store Unit; 2. Descriptor Processing Unit; 

3. IAT and OAT generators; and 4. Memory Address 
and Indexing Constant Generators. Figure 7 shows 
the organization of the above sections. The IAT, 
OAT and address generators were described in the 
previous section. The descriptor store unit 
stores up to 16 vector set descriptors (VSD). A 
simplified descriptor's contents are shown in 
Figure 3(b). 
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Figure 7. Central Index Unit 

A superword access requires an Indexing Event 
in the CIU. During this event the descriptor is 
updated by the Descriptor Processing Unit to 
reflect the access. The processing depends upon 
the kind of descriptor as well as the data values 
within the descriptor. For example, suppose we 
have a two-dimensional vector set operand (e.g., 
K > 1). The processing will be as follows: 

If the length L is longer than a superward 
(N), then the descriptor values are updated as 
follows. These updates are performed after each 
Ssuperword access is initiated. 


b<«btd*wN 
B+ B 
L<«+L-N 
K+ kK 
However, if the length L of the last access 
was equal or less than a superword (N), then the 


next superword should come from the next vector 
in the vector set. The appropriate update equa- 


tions are as follows. 
<« B+D. 

«B+ D | 

< LL 

Ps sk 


A rps of 
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data compression 


These actions cause the length to be reset 
to the initial length (LL), and increment the base 
address (B) to the base of the next vector in the 
set. 


3.2 Other Functions of the Alignment, Indexing, 
and Memory System 


As we mentioned above, the AIM system is also 
responsible for a number of other functions. In 
order to facilitate the smooth flow of data 
through the vector processing elements, forms of 
data other than linear N-vectors must be handled 
more or less automatically. These functions will 
be discussed next. _ 


3.2.1 Automatic Padding of Short Superwords 


As mentioned earlier, not all superwords in 
a vector operation are a full 16 words. Inter- 
nally in the BSP System, the Arithmetic Elements 
(AE) recognize a "NULL" operand. The array memory 
also recognizes the NULL operand and inhibits a 
store when a NULL operand is encountered. The 
control unit automatically causes the alignment 
networks to pad short superwords by selecting. NULL 
operands during input and output’ alignment events. 


3.2.2 Vector Element Conflict 


In the memory storage scheme, if d mod M = 0, 
all the elements of the linear vector lie in the 
Saie memory module. This is referred to as a 
vector element conflict condition. In this case, 
the access to the memory has to be sequential. 

In the BSP System, this condition is handled by 
forcing superword size equal to 1. Thus the BSP. 
System automatically adapts to this case without 
any software or other interruption. 


3.2.3 Inner and Outer Loop Optimization 


Consider the following FORTRAN program 


setment. 
DO 10 IT=1, 14 
DO 10 J=1, 4 

10 


A(I,J) + B(I,J) 


This program can be performed in a single BSP 
vector operation consisting of 14 superwords each 
of length 4. However, it is faster to execute the 
above program segment with inner and outer loops 
interchanged, using 4 superwords of size 14. The 
BSP optimizes these cases by using hardware detec- 
tion of the fastest loop order from the parameters 
L and K of a VSD. Of course, not all loops can bé 
interchanged, and a software check is made to 
allow the above optimization. . 

Space prevents us from describing all the 
other functions performed by the alignment and 
memory system. These functions include, among 
others, handling scalar data in: vector operations, 
and expansion, and mode vector 


operations. 


ee 
tae 


4. Conclusion 


In this paper we have shown one design for a 
conflict-free array access memory. This design is 
based on the use of a prime number of memories. 
Crucial to this design is the simplification of 
the indexing equations which allow most of the mod 
M operations and much of the other index calcula- 
tions to be done with ROM's and other simple hard- 
ware. These simplifications were discussed in 
Section 3, along with a brief discussion of some 
of the necessary indexing hardware. Further 
details can be found in [LaVo/77]. 

The design of this memory system fits nicely 
in the context of the Burroughs Scientific 
Processor ([Stok/77], [KuSt79]). The vector 
machine instructions on this computer can encom- 
pass two levels of loop nesting, and the indexing 
hardware carries out the necessary addressing and 
alignment calculations automatically, once the 
initial vector set descriptors have been set up. 
One of the major problems with large vector com- 
puters has been that indexing overhead and memory 
access conflicts have a significant effect on 
overall vector performance. By using the prime 
memory system and indexing hardware described in 
this paper, the BSP is able to execute vector 
instructions efficiently. 
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Abstract -— The purpose of this paper is to 
present empirical results on the performance of 
parallel computations, with respect to various 
performance criteria, under different assumptions 
of the underlying computer architecture. The 
performance criteria used are the Parallel Index, 
the Speedup, the Utilization, the Efficiency, the 
Redundancy, the Compression and a definition of 
the Quality of the resultant computation. The 
underlying architectures assumed are parallel 
processor organizations of both the SIMD and MIMD 
varieties, with limited and unlimited degrees of 
physical parallelism. 


1 Introduction 


Computer architectures incorporating 
multiple processors which execute in parallel are 
being designed to speed up the execution-time, 
for better cost-performance, greater reliability 
and modularity. The trends appear to be towards 
Special purpose, scientific supercomputers on the 
one hand, and towards general purpose multiple 
microprocessor systems with high performance to 
cost ratios, on the other hand. Although the 
decreasing cost and size of processors makes it 
feasible to consider using a large number of 
processors in a computer organization even at 
reduced efficiency of each component processor, 
it is important to estimate the effective speed 
actually attainable, over a representative set of 
computations. The sample considered in this 
paper may be described as existing general 
technical computations drawn from military, 
commercial and academic environments. It is 
emphasized that we are not interested in the 
maximum or minimum performance for any individual 
computation, but in the average performance over 
all the computations. 


A computer organization with p parallel 
processors will rarely attain its maximum 
parallel execution bandwidth of p operations per 
time-unit, or a speedup in execution—time of p 
times that of the uniprocessor organization, due 
to both logical and physical constraints on the 
parallel execution of operations. Logical 
constraints on parallelism include intrinsic 
data-dependencies, control dependencies and 
operator precedences in the program, which force 
a sequential chain of execution amongst the 
dependent operations, and hence limit the number 
of operations which may be executed in parallel 
[9]. Physical constraints on parallelism include 
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the maximum number of processors available in the 
architecture, the control restrictions on the 
different types of operations which may be 
executed simultaneously, and the delays due to 
the communication and competition amongst the 
interacting components in the computer 
organization. The empirical results in this 
paper account for the logical constraints and the 
first two physical constraints mentioned. For 
tractability, the experiments assume that the 
rest of the system, like effective 
memory—processor and processor-—processor 
bandwidths, are balanced with respect to the 
execution—bandwidth of the parallel processors. 
This may even be considered an advantage since 
the results are then independent of the specific 
machine implementation. The empirical 
performance results in this paper should be 
interpreted as the best performance results 
expected with current techniques for parallelism 
exposure in Fortran programs [5-8, 2], assuming 
no delays due to the cooperation and competition 
amongst the components of the parallel processor 
organization. The results would be degraded 

if communication delays within the system are 
considered, but at the same time, the results 
would probably be improved by the explicit 
Specification of parallel programs or by even 
better algorithms for the automatic conversion of 
serial programs to parallel computations. 


Some of the questions we ask are: What is 
the performance of a computer organization with a 
limited number of parallel processors in the 
architecture? What is the performance in the 
idealized case where the number of processors is 
essentially unlimited? Are there severe 
performance degradations when only one type of 
operation may be executed simultaneously in one 
time-—unit by the active processors? 


2 Model and Definitions 


A parallel computation is a sequence of 
steps, where each step consists of i operations 
which may be executed simultaneously, by i 
parallel processors. A step with i simultaneous 
operations is said to have degree of parallelism 
i, 1<i<P, where P is the maximum degree of 
parallelism in any step of the computation. 
logical parallelism or minimax degree of 
parallelism, P', is the smallest maximum number 
of processors required by the computation in 
order to achieve its minimum execution time, © 
Tmin. 


The 


A parallel processor organization is a 
computer organization with multiple processors, 


each of which is capable of executing one 
operation in one time-unit. Each processor is 
also capable of executing the whole repetoire of 
operations. An SIMD (Single Instruction Multiple 
Data) organization is a parallel processor 


organization where only one type of operation may | 


be executed by the active processors in any one 
time—unit. An MIMD (Multiple Instruction 
Multiple Data) organization is a parallel 
processor organization where more than one type 
of operation may be executed by different 
processors in the same time-unit [3]. 


A parallel processor organization with p 
processors available in the architecture is said 
to have limited physical parallelism of degree p, 
and denoted a p-limited architecture. A parallel 
processor organization which always has as many 
processors, P', as required by the computation in 
order to achieve its minimum execution-—-time is 
said to have uniimited physical parallelism, and 
denoted an unlimited architecture. A _p-limited 
computation is a parallel computation executing 
on a p-limited architecture, and an unlimited 
computation is a computation executing on an 
unlimited .architecture.. 


TOP—form. 
of parallel computations defined as the following 


3—-tuple; 
(T(P), OME) P) 


where T(P) is the execution-time of the 
computation in steps, O(P) is the 
computation-size in number of operations 
executed, and P is the maximum degree of 
parallelism in the computation. P=P', the 
logical parallelism, for unlimited computations, 
and Psmin(P',p) for p-limited computations. 


The TOP—form captures the fundamental 
difference in the dimensions of the parallel 
computation when compared to a serial 
computation. In a serial computation, since each 
operation takes one time-unit for execution, the 
execution—time and computation-—size have the same 
value, T(1)=0(1). But in a parallel computation 
where P>1, the execution—time and 
computation-size necessarily have different 
values, T(P)<O(P), forming two distinct 
dimensions of a parallel computation. The 
maximum degree of parallelism forms a third 
variable dimension in a parallel computation. 


Equivalence, Optimality and Acceptability. 
Computations are said to be equivalent if given 


the same inputs, they always produce the same 
outputs. The internal algorithms and 
intermediate results in equivalent computations 
need not be the same. 


An optimal serial computation is defined as 


a serial computation with the minimum | 
computation-size, Omin. An optimal parallel 
eomputation is defined as a minimum-—time 


The TOP-form is a canonic form — 
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minimax—parallel computation, which is a 


‘computation that achieves the minimum execution 


time, Tmin, using the minimax degree of 
parallelism, P'. Further discusssion on these 
definitions of optimality are available in [9]. 


A serial computation is said to be 
acceptable for comparison with a parallel 
computation if 0(1)<O(P). Otherwise, the O(P) 
operations of the parallel computation may be 
executed one at a time to obtain a shorter serial 
execution time and computation size. A parallel 
computation is said to be acceptable for 
comparison with a serial computation if 
T(P)<T(1). Otherwise, the 0(1)=T(1) operations 
of the serial computation may be executed using 
one processor to get a shorter parallel execution 
time. Hence, we propose the following: 


Principle of Acceptable Parallel-Serial 
Comparisons. A performance comparison of a 


parallel computation with an equivalent serial 
computation is said to be acceptable iff 


T(P) < T(1) and 0(1) < O(P), or equivalently 


TCP) < O(1) < OCP), if each operation takes 
one time-unit for execution. 


This principle of acceptable 
parallel-—serial comparisons is necessary to 
ensure that any measured performance improvements 
are due solely to parallel versus serial 
processing, rather than due to other factors. 

For example, the Speedup in execution time may be 
greater than p, the number of processors 
available in the architecture, when a parallel 
computation is compared with an unacceptable 
(nonoptimal) equivalent serial computation. Part 
of the performance improvement in this case is 
due to the optimization of a relatively 
inefficient serial computation. Similarly, the 
Speedup in execution time may be less than one, 
if the parallel computation entering into the 
comparison is unacceptable. 


Performance Measures. ‘The TOP-form of a 
parallel computation and its equivalent serial 
size form the smallest set of parameters for the 
evaluation of all the performance criteria 
considered in this paper. Basically, the 
performance criteria fall into four categories: 
the speed of execution given by the Parallel 
Index and Speedup measures, the utilization of 
the processor-time resource given by the 
Utilization and Efficiency measures, the 
Compression (or conversely, the Redundancy) in 
the size of the computation, and the resultant 
Quality of processing. 


The Parallel Index and_ Speedup measure the. 
average and effective speed, respectively, of the 


parallel computation in operations executed pee 
time-—unit : 


PI(P) 
S(P, 1) 


= O(P)/T(P), 


= O(1)/T(P) = T(1)/T(P). 


The Speedup is defined with respect to the 
computation-size of an equivalent serial 
computation, and takes into account the extra 
operations introduced into a parallel computation 
to reduce its execution time. The Speedup may 
also be regarded as the ratio of the 
execution-—time of the serial computation, to that 
of the parallel computation, making it equivalent 
to the definition found in [5]. 


The Parallel Index and Speedup may also be 
regarded as measures of the average and effective 
parallel execution bandwidths of the underlying 
parallel processor organization, during the 
execution of the given computation. 


The Utilization and Efficiency measure the 


cost-effectiveness of the computation in the 
sense that they weigh the speed improvement with 
the number of processors required. They measure 
the performance of the parallel computation with 
respect to its use of the processor—time 
resource: 

UCP) = OC(P)/[P.T(P)],  ECP,1) = O(1)/[P.T(P)]. 

In figure 1, the Utilization is the 

‘proportion of the rectangle PxT(P) covered by 
busy processor-steps, i.e., those time-units 
where a processor is busy executing an operation. 
The Efficiency may be regarded as the ratio of 
the serial processor—time requirement over the 
parallel processor-—time requirement, since 
T(1)=0(1). 


The Redundancy measure is the ratio of the 
parallel computation-size, O(P), to the serial 
computation-size, 0(1), of an equivalent serial 
computation. The Compression measure is the 
inverse ratio: 

R(P,1) = O(P)/0(1), and C(P,1) = 0(1)/0(P). 

One significance of the Redundancy measure 
is that it relates the relative speed and 
efficiency measures, S and E, to the absolute 
speed and efficiency measures, PI and U: 
S$ = C.PI = PI/R and E = C.U = U/R 

The Speedup, Efficiency and Compression 
measures compare serial to parallel execution-— 
time requirements, processor-time requirements 
and computation-size requirements, respectively 
(see table 1). In an optimal serial computation, 
the execution—time, processor-—-time and 
computation-size requirement are each equal to 
Omin. The Quality measure is defined as an 
overall performance measure comparing serial to. 
parallel computations with respect to these three 
requirements: 


Onin? 
Q(P,1) = S.E.C = S.E/R = -~---3------- 
T(P)* -O(P). P 
Henkes the quality measure is a more 
stringent measure of the performance improvement 


(2) maximizing Efficiency = 


(3) maximizing Compression = 


of parallel versus serial processing than the 
Speedup measure. One use of the Quality measure 
is to decide whether parallel processing is 
preferable to serial processing for a given 
program. For example, a computer installation 
may decide that parallel processing is desirable 
for a given program if the quality of processing 
increases by at least fifty percent (Q>1.5). 


In all 
comparisons, 


acceptable parallel-—serial 
the following relationships hold: 


<5 < PIC(P) < P 
/P < E(P, 1) <UCP) < 1 
< R(P,1) < 1/E(P,1) < P 
P < E(P,1) < C(P,1) < 1 
P,1) < S(P,1) < PI(P) < P 

PI is an upper bound for S, and U is an 
upper bound for E, with equality iff O(P)=0(1) so 
that R=C=1. The standard of comparison, an 
optimal serial computation, has PI, S, U, E, R, C 
and Q all equal to unity. 


The last relationship above shows four 
successively refined measures of the performance 
improvement of parallel versus serial processing. 
First, P=min(p,P') indicates the maximum 
processor bandwidth, or the maximum speed of the 
parallel computation. Then, the Parallel Index 
indicates the average processor bandwidth, or 
average speed, of the computation. Third, the 
Speedup indicates the effective processor _ . 
bandwidth, or effective speed, of the 
computation. Finally, the Quality is a single 
performance measure that takes into account © 
mainly the speed improvement, but also the 
efficiency and the redundancy of parallel versus 
serial processing. 


OPTIMIZATION OF PERFORMANCE MEASURES 
AND TOP-FORM PARAMETERS 


TABLE 1: 


Performance Measure TOP-form parameter 


(1) maximizing Speedup = minimizing T(P) 
minimizing PxT(P) 


minimizing O(P) 
(minimizing Redundancy) 


(4) maximizing Quality = all of the above: 
minimizing T~.0.P 
(minimizing each component of TOP=t Ore 


with emphasis on time) 


Measures of Central Tendency. To 


_ characterize the performance of a set of 
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computations rather than an individual 
computation on a given parallel organization, © 
measures of the central tendency of the data are 
desired. In table 2, the sample mean of the | 
performance measures and TOP-form parameters are 
given, and in table 3, the median values are 
given. In table 4, another measure of central 


tendency is introduced, called the aggregate 
performance measures [9]. The aggregate 
performance measures are performance measures 
defined for the aggregate computation, which is 
the computation consisting of every step of every 
computation in the set of computations. In other 
words, the aggregate computation is the 
end-to-end concatenation in time of all the 
computations in the set. In parallel-serial 
comparisons, the aggregate performance measure is 
a ratio of the sum of the requirements of all the 
serial computations in the sample, divided by the 
sum of the corresponding requirements of all the 
equivalent parallel computations. For example, 
the aggregate speed measures for a set of 
computations are defined as: 


a smmmcuecrcemcenniend Ce eedaigtieiaeaeniead 


S. 2-2 FECT 7- 2 TCR) 2-1) Y TLR) 
pra = 5 o(P) / E TCP) = OCP) / TO) 
PI? and s* may be regarded as the 


average and effective parallel execution 
bandwidths, respectively, for a set of 
computations, in a single program environment, 
i.e., when the execution of the next computation 
in the set does not start till the execution of 
the current computation has ended. 
the processors are utilized by a computation 
during all steps of its execution, the | 
introduction of multiprogramming could reduce the 
overall execution-—time of all the computations in 
the set, though it cannot reduce further the 
execution time of any individual computation. 
Hence, PI and S may be interpreted as lower 
bounds for the average and effective parallel 
execution. bandwidths in a i a 
environment. 


Whereas the mean performance measures give 
equal weight to each computation in the set, the 
aggregate performance measures tend to weigh each 
computation by the relative magnitudes of its 
computation-—-size and execution-—time. It seems 
reasonable that the overall performance should be 
more affected by a longer computation than by a 
shorter one. In general, the aggregate 
performance measures indicate the performance of 
all computations considered as a whole, whereas 
the mean performance measures indicate the 
expected performance for any one computation in 
the set of computations. 


3 The Experiments 


The raw data is obtained from runs of the 
Illinois Analyser version 2 [7], which transforms 
ordinary serial programs into equivalent parallel 
computations. The Analyser. incorporates 
sophisticated algorithms for recognising serial 
program constructs and converting these to 
parallel program constructs for fast and 
efficient parallel execution. Version 2 (1978) 
of the Analyser differs from version 1 (1973) [5] 
mainly in the improved handling of linear | 
recurrences [2] found in the serial program. 


Since not all - 
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Existing Fortran programs (ANSI standard) 
were obtained from various locations, like the 
Air Force Weapons Laboratory, Burroughs — 
Corporation, the collected algorithms published. 
by the ACM, a well-known scientific library of 
programs called EISPACK, some of the old programs © 
from the Illinois Analyser Version 1 (prior 
1973), and other miscellaneous sources. These 
programs are run through the Illinois Analyser 
version 2, which produces as output, the 
dependency graph of each program. This is then 
entered as input to simulators, which restructure 
the computation when necessary, to execute on 
either an unlimited MIMD organization, an 
unlimited SIMD organization, or a p-—limited SIMD 
organization where p=2 , for i=1,2,..., 14. 

Hence, 16 different parallel processor 
organizations are compared with the uniprocessor 
organization. 


Various data on the nature of the serial 
program and its parallel equivalent are 
collected, from which we abstract sixteen sets of 
raw TOP—forms and the raw serial computation 
size, 0(1), for each computation, to use as 
inputs to our analysis programs. First, a 
standardization procedure is performed, to ensure 
that only acceptable parallel-serial comparisons 
of performance are produced. Essentially, the 
standardization consists of estimating the 
optimal serial computation size and the optimal 
paraliel TOP-form for each 
comput ation-architecture combination [G7 The 
performance measures are then calculated from 
these standardized TOP—forms. The rows labelled 
"SIMDB" and "MIMDB" in tables 2,3 and 4 refer to 
the unlimited SIMD and MIMD cases, respectively. 
The row labelled "M/S ratio" gives the statistics 
for the ratio of values in the unlimited MIMD 
over the unlimited SIMD cases, to compare the 
effect of the added control restriction of SIMD 
parallel architectures. All the statistics in 
the tables are calculated for the entire sample 
of 355 computations. 


4 The Results. 


TOP—form parameters: When the same sample 
of 355 computations is executed with varying 
degrees of limited physical parallelism, both the 
mean and median execution-times decrease, and. 
both the mean and median computation-sizes | 
increase, as the number of processors increases. 
In each case, the sample mean is about two orders 
of magnitude larger than the sample median, 
indicating a distribution that is skewed to the | 
right. The mean and median execution-times. in an . 
unlimited MIMD environment are about 60% of the 
corresponding execution-times in an unlimited 
SIMD environment. 


The median values of the maximum number of 
processors required, P', indicate that if up to © 
64 parallel processors are available in the | 
architecture, more than half the computations 
executed will utilize all the processors 
available during execution. For this sample of 
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computations executing under an SIMD environment, 
half the computations require not more than 100 
parallel processors. Also, half the computations 
use twice aS many processors when an MIMD 
organization is assumed than when an SIMD option 
is assumed. 


Parallel Index and Speedup. In [9], 


probabilistic hypotheses on the parallelism 
distribution in computations were proposed, 
yielding simple characterizations of the average 
speed, PI. Necessary and sufficient conditions 
were given for PI to have a value on the order of 
P/in(P), and for PI to have an upper bound of 
P/in(P). The natural logarithm function of P, 
In(P), is used as an approximation to the Pth. 
Harmonic number, H(P). The predicted values of 
PI, which are upper bounds for S$ in all 
acceptable parallel-serial comparisons, agree 
well with empirical observations. 


In figure 2, the average and aggregate 
values of PI are well approximated by the p/l1n(p) 
curve, for p<300. For larger p, the average and 
aggregate values of PI are less than p/ln(p). 
Similarly, in figure 3, the average and aggregate 
values of S are well approximated by p/ln(p), for 
p<100, and less than p/in(p) for larger p. The 
In(p) curve forms a lower bound in each case. 


The mean, aggregate and median PI values 
tend to run approximately parallel to the 
corresponding Speedup values. Hence, the trend 
of these measures of central tendency (mean, 
median, aggregate) of the Speedup values are well 
predicted by the corresponding trend of the PI 
values, and vice versa. 


Figure 4 plots the Parallel Index and 
Speedup values, averaged over every ten 
consecutive points, assuming an MIMD organization 
with unlimited physical parallelism. The solid 
lines are the smoothed curves automatically 
generated for the crosses (PI) and diamonds (S) 
by the plotting package [1], using a smoothing 
algorithm involving running means, running 
medians, quadratic interpolation and "hanning" 
[11]. There is excellent agreement between the 
observed average PI values and the P'/1ln(P') 
curve, for this range of P'. The Speedup values 
tend to lie below the P'/ln(P') curve. 


Similar plots for the unlimited SIMD 
computations indicate that there are no 
Significant differences in the trends of the 
observed PI and S values when compared with the 
unlimited MIMD case. | 


Binomial tests [4, 9] with a significance 
level of 5% were performed which indicate that 
the majority (more than 50%) of computations 
encountered have Parallel Indices and Speedups 
less than P'/1n(P'), in an SIMD or MIMD 
environment with unlimited physical parallelism. 
In fact, for unlimited MIMD computations, 


Prob{S(P',1) < P'/in(P')} > 0.75 
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In an SIMD environment with p—limited 
physical parallelism, the majority of 
computations have Parallel Indices and Speedups 
less than p/H(p) for p sufficiently large (p>64 
processors for PI, p>16 processors for S). More 
than 80% of the computations have Speedups less 
than p/H(p), for p>256 processors. 


For any 


Utilization and Efficiency. 


individual computation: 


PI(P) = U(P).P, and S(P,1) = E(P,1).P, 

where P may be interpreted as the physical 
parallelism, p, in a p—limited parallel 
architecture, or as the logical parallelism, P', 
in an unlimited parallel architecture. 


This relationship between the Parallel 

Index and Utilization measures, and between the 
Speedup and Efficiency measures also holds for 
the corresponding pairs of mean, median and 
aggregate values for a set of p—limited 
computations. For example, 

PI(p) = U(p,1) . p, and S(p,1) = E(p,1) . p. 

It is sometimes hypothesized that the 

Speedup is a linear function of p, of the form, 
k.p, for some constant k<1. However, figures 5 
and 6 clearly show that the mean, aggregate and 
median values of U and E are not constants 
independent of p. Hence, it is impossible for 
the corresponding values of PI and S to be linear 
functions of p. In fact, the mean, median and 
aggregate values of U and E are decreasing convex 
functions of p, implying that the corresponding 
values of PI and S are increasing concave 
functions of p. Note that the p/ln(p) 
characterization of PI and S is an increasing 
concave function of p. 


In the case of unlimited physical 
parallelism, the smoothed curves of U(P') and 
E(P',1) are well approximated by the i/ln(P') 
curve (figure 7). However, if the first data 
point is ignored, it is not clear that U(P') and 
E(P',1) are necessarily decreasing convex 
functions of P'., In fact, they could be 
described as very slightly decreasing linear 
functions of P', which may even be considered 
constant functions, independent of P'. The first 
data point plotted has P', U(P') and E(P',1) all 
identically equal to 1.. The ten computations 
represented by this data point are those where 
the optimal parallel computation is in fact a 
serial computation, since the minimax degree of 
parallelism, P', is equal to 1. Except for this 
first data point, all the other averaged U and E . 
values lie between 0.1 and 0.4. These empirical 
observations suggest the following 


Hypothesis on the Conservation of 
Processor-Time. The ratio of tne processor-time 
requirement of an optimal serial computation to 
that of an equivalent optimal parallel 
computation is: 3 


Onin 
E(P',1) = ------- = k, where k<0.5. 
P'xTmin | 


This hypotheses on the conservation of 
processor—time does NOT imply that when more 
processors are available to execute a given 
computation, then the execution-time will 
decrease accordingly, so that the processor—time 
requirement stays constant. Rather, it implies 
that the optimal parallel processor—time _ 
requirement is more than twice the optimal serial 
processor—time requirement, and the ratio of the 
two quantities appears to be fairly constant over 
many different computations. 


There are no major differences in the 
utilization of the processor-time resource, when 
the computations are structured for an SIMD 
organization rather than an MIMD organization, 
with unlimited physical parallelism. 


Redundancy. In the empirical results, the 
median redundancy is less than 1.15 for all MIMD 
and SIMD computations, with unlimited and limited 
degrees of physical parallelism. Hence, half the 
computations achieve a parallel execution-time 
less than the serial execution-—time, with the 
introduction of less than 15% of redundant 
operations compared with the serial computation 
size. Although the mean redundancy is less 
robust than the median redundancy to extreme 
values, it is less than e-OSTOBNE)? (figures 8 and 
9). 


A variable X is said to be positively 
associated or in agreement, with another variable 
Y if large values of X tend to occur with large 
values of Y, and small values of X tend to occur 
with small values of Y. Similarly, X is said to 
be negatively associated, or in disagreement, 
with Y if large values of X occur with small 
values of Y, and small values of X occur with 
large values of Y. The Spearman rank correlation 
coefficient, R, may be used to test the degree 
and direction of association between any pair of 
variables. The magnitude of R, O<|R}<1 gives 
the degree of association, and the sign of R 
gives the type (agreement or disagreement) of 
association. | 


From the tests of association based on the 
Spearman correlation coefficient in both 
unlimited SIMD and MIMD cases, the redudancy 
measure is found to be negatively associated with 
the Efficiency and Quality, positively associated 
with P' and the Parallel Index, and not 
associated with the Speedup. It has also been 
observed [10] that larger redundancies are 
associated with larger probabilities of numerical 
instability in the parallel computation, as 
compared with the serial equivalent. Hence, 
parallel computations with large redundancies 
should be avoided, since these tend to be 
associated with inefficient computations with low 
qualities and higher probabilities of numerical 
instability. Also, since the Redundancy measure 
is found to be independent of the Speedup 
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measure, redundant operations should be 
introduced into a parallel computation only if 
this increases the Speedup (effective speed), and 
not just the Parallel Index (average speed), of 
the resultant parallel computation. 


Quality. In the tests of association based 
on the Spearman correlation coefficient, the 
Quality measure is found to be independent of the 
logical parallelism, P', in both SIMD and MIMD 
computations, assuming unlimited physical 
parallelism. This is a desirable result for the 
chosen definition of the Quality. measure, since 
the quality of processing should not be biased 
towards computations with either smaller or 
larger degrees of logical parallelism. 


The empirical median quality decreases as 
the physical parallelism increases beyond p=4, 
and is less than one, in all cases. Hence, more 
than half the computations in each p—limited and 
unlimited SIMD and MIMD case have higher 
qualities when executed as a serial computation 
than when executed as a parallel computation. 


Unlike the relationship between the mean 
quality and mean speedup, the definition of the 
aggregate quality does not constrain it to have 
an upper bound given by the aggregate speedup 
measure. In the sample of computations examined, 
the aggregate Quality increased in value, at 
approximately the same rate as the aggregate 
Speedup, up to p around 100 (figure 11). As p 
increased beyond this “saturation point", the 
aggregate Speedup began to level off and the 
aggregate Quality declined. This behaviour is 
representative of most individual and aggregate 
computations, and hence the Quality measure may 
be used to chose the optimal number of processors 
to use in executing a given computation, or set 
of computations. 


Figure 12 shows the frequency distribution 
of the values of p at which the highest quality 
is attained for each computation in the sample. 
Almost half (46%) of the computations examined 
attain their highest quality value of one, at p=1 
(serial computations). The next largest 
frequency occurs at p=16 and p=32, where about 
eight percent, each, of the computations attain 
their highest quality values. The cumulative 
relative frequency curve indicates that about 
ninety percent of the computations attain their 
highest quality values for p<256. If this sample 
is representative of computations in general, 
then the parallel processor organization need not 
have more than 256 processors, in order that 
ninety percent of the computations executing on 
it may attain their highest quality potential. 

It is an interesting coincidence that the ILLIAC 
IV, an SIMD machine, was originally designed to 
have a maximum of 256 parallel processors. 


5 Conclusions 


The characterization of the performance 
measures varies according to the maximum degree 


of parallelism, P. Hence, we define the 
following approximate ranges: i 
Let P=0(1) denote 1<P<5, and P=0(107) 


denote 5.1077! 


<P < 5.10", for i=1,2,... 

For p-—limited architectures, the best 
characterization for the mean or aggregate values 
of the speed measures are (figures 13a,b,c): 


Approx. range of p Speed: PI, S 


p= 0(1) O(p) 

p = 0(10) or 0(100) O(p/1n€p)) 

p = 0(1000) O(1n(p))<PI,S<O(p/1n(p)) 
p = 0(10000) or more O(ln(p)) 


In the case of unlimited physical 
parallelism, PI and S are best characterized as 
OCP'/1In(P')), for all P'=0(1000) or less. 


These empirical observations support the 
characterization of parallel computations 
in [9]: 


speed 
given 


"For general technical computations, the 
measures of central tendency such as the mean, 
median and aggregate values of the Parallel Index 
and the Speedup, all lie between k,-1n(P) and 
k,.P/ln(P), where 0<K,<1 and k,>1. 

Furthermore, the majority of computations will 
also have individual PI and S values between 
these lower and upper bounds. P may be 
interpreted as either the logical parallelism P', 
in an environment with unlimited physical 
parallelism, or as the physical parallelism, p, 
for sufficiently large p, in an environment with 
limited physical parallelism. So, 


max(1, k,.1n(P)) < S(P,1) < PI(P) 
< min(k,.P/1n(P), P)" 


Suppose that k,=0.5 and Kn=e2. Then, 
for P>8, ln(P)/2 is greater than 1, and 
2.P/ln(P) is less than P. Hence, except for the 
smallest P values, the O(1ln(P)) and O(P/1n(P)) 
bounds form increasingly tighter bounds for §S and 
PI, as P increases, when compared with the 
absolute limits of 1 and P. 


The empirical speed characterization may be 
used as a rough guide to the minimum number of 
parallel processors needed to attain a certain 
average or effective parallel execution 
bandwidth. For example, if an average parallel 
execution bandwidth of ten operations per 
time-—-unit is desired, then at least 36 parallel 
processors should be used, since p/ln(p) = 
36/1n(36) = 10.05. This predicted speed of 
O(p/ln(p)) operations per time-unit for p=0(10) 
processors should be regarded as the expected 
Speed potential, since in practice, interactions 
between processors, memories and other components 
of the computer organization will cause further 
performance degradations. Conversely, given a 
parallei processor organization with Limited 
physical parallelism of degree p, the appropriate 
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Speed characterization given above may be used to 
estimate its expected speed potential. 


For p-limited architectures, the best 
characterization for the efficiency measures, U 
and E, are obtained by dividing the corresponding 
values of PI and S by p. For example, if p=0(10) 
or 0(100), U and E are characterized by 
O(1/ln(p)). Hence, if an efficiency of at least 
25 percent is desired, then less than 60 
parallel processors should be used, since 1/ln(p) 
= 1/1n(59) = 0.25, but 1/1n(60) = 0.24. For a 
smaller efficiency, more parallel processors may 
be used. 


For p-—limited architectures, the empirical 
mean and aggregate Utilization and Efficiency 
measures substantiate the observation that the 
corresponding mean and aggregate PI and S 
measures are increasing concave functions of p, 
like p/ln(p), and not increasing linear functions 
of p. 


For unlimited architectures, the averaged 
Utilization and Efficiency measures defined with 
respect to the logical parallelism, P', suggested 
a hypothesis on the conservation of 
processor—time. 


Parallel computations with large Redundancy 
measures should be avoided since these are 
associated with inefficient computations with low 
qualities and higher probabilities of numerical 
instability. Also, redundant operations should 
be introduced into parallel computations only if 
this decreases the parallel execution-time when 
compared with known equivalent serial 
execution-times. 


The Quality measure may be used to choose 
the optimal number of processors to use in 
executing a given computation or set of 
computations. 


The mean, median and aggregate values of 
the M/S ratios for the Speedup and Quality 
measures imply that the performance improvement 
of MIMD versus SIMD organizations is less than 
two and a half times. This same performance 
improvement may also be obtained by increasing 
the number of processors available in the 
architecture. For example, to upgrade the 
performance of a given p-limited SIMD 
architecture, the incremental cost of conversion 
to the less restrictive MIMD organization should 
be compared with the incremental cost of adding 
more parallel processors to the SIMD 
architecture. 
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PERFORMANCE EVALUATION OF PIPELINE ARCHITECTURES 


Jamshed H. Mirza 
Division of Computer Science 
Polytechnic Institute of New York 
Brooklyn, New York 11201 


Summary 


This paper is a report on the investigation of a 
generalized method for evaluation of pipeline 
processors under more realistic conditions than 
has been previously considered. | 


In Section I, our aim is to consider several al- 
ternatives for a performance measure for pipeline 
architectures, and come up with an index that 
clearly reflects how well the pipeline has been 
organized from a purely architectural point of 
view. The chosen index should be independent of 
all issues unrelated with the architectural 
sophistication of the design. Such a performance 
measure would be useful during the initial design 
stages for analysing a design to see if it is 
likely to meet the stated requirements. It can 
be used for studying and comparing several alter- 
natives for a design, and as an aid to making 
relevant architectural decisions based upon that 
Study. It can be used for studying and evaluat- 
ing several structurally different pipeline 
architectures and to determine which, if any, is 
inherently superior under a given job environment. 


In section II, a Markov Chain model is proposed 

for pipeline processors, and a method is suggest- 
ed for determining the performance factor. This 

is followed by an illustrative example and some 
results of a preliminary analysis of the Texas 
Intsturments Advanced Scientific Computer (TI-ASC). 


Section I: Our aim in this section is to inspect 
several alternative performance indices for pipe- 
line architectures, and select one that best 
satisfies the following two conditions: (a) it 
should clearly reflect the sophistication of the 
design, viewed from a purely architectural point 
of view, and (b) it should be unaffected by as- 
pects that are not relevant to the basic archi- 
tecture. To satisfy these conditions, we need a 
performance factor that shows the increase in 
throughput rate attained by the pipelined archi- 
tecture, as compared to an unpipelined architec- 
ture supporting strictly sequential, non- 
overlapped execution. 


The throughput rate of a pipeline is directly 
affected by the number of segments in the pipe, 
the job characteristics, the pipe structure, and 
the pipe configurability. By jobs we mean units 
of computation that are separately initiated. In 
the usual instruction processing pipe, a job 
refers to a machine instruction. The jobs pro- 
cessed may be logically dependent or independent 
of each other. The pipe structure may be linear 
(when each segment receives control from only one 
segment and transfers control to only one other 
segment), or planar (when the previous restriction 
does not hold). The pipe may also be configurable 
(when the segment-interconnection structure is 
capable of taking different forms at different 
times), or non-configurable (when the interconnect- 
ion structure is always constant). 
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The number of segments in the pipe will determine 
the extent of possible overlapped execution, and 
therefore the possible increase in throughput 
rate. Job characteristics will decide whether 
logical dependencies are possible. Dependent 
jobs imply more complex control requirements and 
lower utilization and throughput rates. Pipe 
structure, on the other hand, will decide whether 
job collisions are possible. While planar pipes 
allow the sharing of common segments among two or 
more functional pipes, they also introduce the 
possibility of job collisions (two or more jobs 
attempting to use a particular segment at the 
same time ). Detection and avoidance of col- 
lisions also result in complex control require- 
ments and reduced throughput rate. Configurable 
pipes also allow the sharing of segments while 
reducing the collision problems. However, they 
entail a reconfiguration overhead; a job require- 
ing a configuration different from the one 
currently existing is held up until pipe is com- 
pletely flushed. Thus the efficiency of a pipe- 
line architecture tends to increase with the 
number of segments into which it is divided, 
while it is adversely affected by logical 
dependencies, collisions and reconfiguration 
overheads. 


Previous attempts at evaluating pipeline archi- 
tectures can be found in [l, 2, 3]. However, they 
all consider linear, non-configurable pipes and 
only [3] considers logical dependencies between 
jobs. | 


There are several alternatives for a performance 
measure for pipeline architectures that ought io 
be considered. Manufacturers have used the seg- 
ment clocking rate to show the absolute raw 
processing rate of a pipelined machine. However 
that need not reflect the architectural sophisti- 
cation. The number of segment in the pipe is an 
appealing parameter to be used as an index of the 
pipelining used. However, it fails to consider 
delays due to dependencies, collisions and re- 
configuration. Moreover, one could increase the 
number of segments by introducing unnecessary non- 
compute segments which would result in no real 
gain in throughput rate; it may in fact deterior-~ 
ate because dependent instructions may now have to 
wait even longer for the dependencies to be 
resolved. Utilization (average number of active 
Segments at any time) would reflect the extent of 
performance deterioration because of various 
delays. However, it would not show the extent of 
segmenting employed. One could raise the 

‘apparent utilization by reducing the number of 
segments in the pipe although it would tend to 
reduce the throughput rate. Average instruction- 
initiation rate, used as a performance index 

gives a measure of the processing rate of the 
system if the various delays are considered, 
However it also fails to take into consideration 
the number of segments in the pipe. 


None of the above alternatives shows itself to be 
the unified quantity we are looking for-one that 
takes into account both the number of segments in 
the pipe as well as the delays due to dependencies, 
collisions and reconfigurations. The performance 
factor finally chosen is in fact a combination of 
two of the alternatives considered. It is given 
by: 


PF =m /d 
ave ave 


Here, m is the number of segments in the differ- 
ent funél¥onal pipes weighted by their probability 
of being traversed. day is the average job- 


initiation rate; it takes into account the 
relative frequency of occurence of the different 
instructions and the delays due to the various 
reasons. Note that often non-compute segments are 
inserted in the pipe in order to balance out the 
flow of jobs through the different functional 
pipes so as to avoid or reduce delays due to 
logical dependencies and collisions. Since these 
segments perform no computational step, they 
should not be considered in the determination of 
m . However, since they help to reduce delays, 
these con-compute segments should be considered in 


the evaluation of d ‘ 
ave 


The ratio of m to qd. 


ave 
average number of segments in the pipe that are 
active at any time (i.e., actually processing an 
instruction). This provides a measure of the 
speed-up realized over strictiy sequential, non- 
overlapped execution. Factors that affect the 
absolute throughput rate, but have no bearing 
upon the pipeline characteristics of the system 
have no effect upon the performance factor. 


= effectively gives the 


Section II: In this section an analytic method, 
based on a Markov Chain model, is presented for 
analysis and performance evaluation of a pipe- 

line architecture. The proposed method is general, 
and is applicable to multifunctional pipeline 
systems with N > 1 functional pipes. The pipes 
may be linear or planar, non-configurable or con- 
figurable, and the jobs processed may be mutually 
dependent or independent. 


Let (S ,P) bea multifunctional pipeline system 
containing N functional pipes. 


sS= f Si> So S39 eileoe, <a a. } is the set of m 
physically distinct segments in the system. Some 
of these segments may be functionally identical 
if, for overall efficiency reasons, certain 
relatively over-utilized segment-types are 


replicated. Also, some of the segments may be 
non-compute. 


Each segment 2% is specified by a 3-tuple: 

e = (F(S,); UCS,), CS)? ) where ES.) is the 
set of operations performed by 8. and uGs,) and 
c(S,) identify the set of source ("used") and 


destination ("changed") elements referenced by 


Se 
J 


P = {P,> Po P.> a> Py } | 
different functional pipes. They define a differ- 
ent path through a subset of the segments. Thus 


is the set of N 
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each functional pipe is defined by an ordered 
Sequence 


Pp nn a S 


ee a a S59? Sie 2 iD ee 


i,m, 
i 
where S. j € S , and is the segment that a job 
3 


traversing functional pipe P, would be in during 
the jth active cycle after initiation. 


Let aw =<@ 
job-profile 


«2+ @. >, the expected 


9 An, &,, 
Pnownl: where g, is the 
i 


e aisSo 


probability that a job entering the system will 
traverse functional pipe Pi. We insist that 


N | 


a a. = Il 
A | 


i=l . 
If this is not the case (allowing for the possi- 
bility that at certain points in time no job 
arrives at the pipe for initiation), we intor- 


duce a fictitious "null" pipe aa which uses 


no segments and for which @ =l1- a: 


N=1 
The proposed method of analysis assumes that all 
segments have identical and constant processing 
time so that the segments are clocked synchron- 


ously. This is a reasonable assumption that 
simplifies the analysis without limiting its 
applicability. 


We also assume that a "Delay-Before-Initiation" 
(DBI) strategy is used for job initiation. 
According to such a strategy, all delays neces- 
sary for proper execution of a job are inserted 
before the job is actually initiated. When a job 
arrives at the pipe, it is delayed just sufficient- 
ly so that once it is initiated, it will not have 
to be held up at any segment within the pipe in 
order to resolve dependencies or avoid collisions 
with jobs that had entered the system earlier. 


The DBI strategy is unlike the 'Delay-After- 
Initiation" (DAI) strategy. In the case of the 
DAI strategy, all necessary delays are not insert- 
ed at just one point right at the beginning. 
Instead, jobs suffer short delays at several 
different stages in its path as required to re- 
move the immediate threat of unresolved depend- 
ency or collision. For planar pipes DAI strategy 
in general results in fewer overall delays than 
the DBI strategy, but is much more difficult to 
analyse and to implement. Our evaluation method 
will therefore yield a wort-case of linear pipes, 
however, both strategies result in the same amount 
of delays, and therefore the assumption about the 
job-initiation strategy is not significant. 


In a practical environment, the job arrivals are 
random and have no particular regularity. More- 
over, logical dependencies will require that 
dependent sequences of instructions to be initi- 
ated in the order they arrive. Consequently, a 
first-come-first-served greedy scheduling strate- 
gy is the practical choice and is assumed here. 
We also assume that the processor is an SISD so 
that at most one instruction is initiated at each 
cycle. If we do allow for the possibility of 
more than one instruction being initiated at. the 
same time, an extra degree of complexity would be 
added to the evaluation process. 


Evaluation of m . This is the weighted average 

of the number of Segments in the N functional pipes 

and is given by 
=). _ ne 

Mave sa a; (m, mi ) 

where 

m= length of functional pipe P. in number of 


segments 
nc : 
m= number of non-compute segments in P. 
a, = probability that a job will traverse pipe P. 


This is the expected job- 
YEIn the ideal case oer = 1. 


Evaluation of d 
Beach ate it a 
initiation rate. 
Under more realistic conditions, when delays due 

to various reasons exist, aes > 1, thus reducing 


the throughput rate. Therefore we need a repre- 
sentation for the state of the pipeline system that 
contains sufficient information to allow us to 
estimate this delay. 


Associated with each functional pipe P. is a Nxp 
binary matrix D, called the Delay Matrix. Here N 
is the number of functional pipes, and 


p = MAX {m, }- 


implies that scheduling a job P, k 
time units after a job. P, has 

been initiated is to be prevented as 
it will violate logical dependency 
rules, cause a collision, or because 
a reconfiguration is required. 
implies P, may be initiated kK times 
after P. has been initiated. 


D, (j,k) = 1 


Hi 
Oo 


D, Cis) 


Note that by a job P, is meant a job that travers- 
es pipe P,. The - Delay Matrix is an extention 
of the collision vector proposed by Davidson et.al. 
r4,5]. Since we are assuming a DBI strategy, 
information regarding logical dependencies and 
reconfiguration overheads can also be determined 
and incorporated into the Delay Matrices [6]. 


The state of the pipelined system with N function- 
al pipes is then defined by Nxp binary matrix q 
such that 


q(j,k) = 1 if and only if scheduling a job P, k 


time units from the present will cause 
P; to collide with some job within the 


pipe, or will allow some logical 
dependency of ae on an earlier job 


within the pipe to go unresolved, or 
if requires a reconfiguration. 


Using such a Nxp binary matrix as the state of the 
system, we develop a Markov Chain model for the 
pipeline processor. The Markov Chain is derived 
by associating with the system an internal state 
every time a job is initiated. These states are 
called the initiation states. For each of the N 
job-types that could be waiting to be initiated, 
the current state provides information about the 
delay that would be incurred, and the next state 
to which the system transfers after the job is 
initiated. Note that at the very beginning, 
before any job has been initiated, the state of 
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the system is do = { 9 } (a matrix with all 
elements zero). 


Definition: # {0 jis an initiation state. 


Io 
If q is an initiation state, then so 
also are all states 


Shl(q,k,) U>, 
where q(i,k,) = 0 


for alll<cie¢N 


and q(i, j) = 1 for allje< k, 


Shl(q,k) is the Nxp binary matrix obtained by 
shifting each row of q, k positions to the left. 
Logically OR-ing the delay matrix D. correspond- 
int to the job P. is initiated. 


The behavior of a pipelie system ( S , P ) can 
now be completely described by the 6-tuple 


(Q,;P,D,0,2% 54) > 


where 
Q : is the set of initiation states of the 
system 


Po: f P.> Po» ee ee Py } is the set of N 


different job-types (functional pipes ) 


De oe 74 D,> Do» ae Dy } is the set of N 
Delay Matrices associated with the N 
job-types 

go 8 «=~6« OK PQ is the next state function; 


given the present stat and the job-type, 
the function identifies the initiation 
state the system will enter when the job 
is initiated 

» * QxXP-o { Ligl=s Seas: oD } is the delay 
function; given the present state and the 
job-type, the function identifies the 
delay that is incurred before the job is 
allowed to be initiated. 


is the initial state of the system; 
qo ~ { 0} 
The functions g and} are defined by the follow- 


ing relations: Let the present initiation state 
be q, and the job waiting to be initiated by Pie 


If q(i,k) 0 

and q(i,j) = 1 for all j<k, 

then 0(q,P,) = shi(q,k)\_ ] D, and 1 (4,P,) = k 
With the help of the next-state function o, and 


qo : 


. the expected job-profile @ , we can find the 


state transition matrix T = {t, j }- T can be 
> 


shown to be a stochastic matrix. Algorithms to 
determine Q,o0 ,A, and T are given in [6]. 


The following theorems are proved regarding the 
system: . 

(a) The set of states Q contains one and only 
one ergodic set A 


QS Q i 
b) If any D. = {9} then 
(b) y D, = {0} 0=Q 
A fast algorithm is suggested in r 67 for determi- 
ning the transition states and the ergodic states. 


A A A A A ; 
Let.Q.=-f dy? Ino Age cere > qd, be the ergodic 
set and T be the 


be the corresponding nxn transition matrix (obtain- 
from T by deleting rows and columns corresponding 
to the transition states). At steady-state, the 
system will be in the ergodic set; let 

A_ <A A A a, 7 

TT _ < 14? TT? 13? eee, Th > be the Steady- 
state state-probability vector which can be 
determined by solving the set of equations: 


A 


n 
and A 
y my 21 


A 
Let , be the nxN delay function matrix corre- 
sponding to the n ergodic states; it is obtained 
from the delay function) . 


Then d can be shown to be: 
ave transpose 
= 7x 
Cave it xX x (a) 
A simple extention of the evaluation method, 
allows pipelined vector processors to be handled 
f6). 
The following is a simple example to illustrate 
the performance evaluation method developed here. 
The hypothetical pipeline processor being con- 
sidered here is planar, non-configurable and multi- 
functional with two functional pipes. 


Processor: = (S , P ) 
S = {$5 S. S.> wae 4 Sy } 
Poe Pi» Po» } 


P 


il 


yt 8h 5 53978550: > 


Po =< 8156575655585 > 
Based on this information and the complete speci- 
fication of the 3-tuples corresponding to each 
segment S, (not shown here), the Delay Matrices 
are found’ to be : 

a fecaccoo D. = eee 

—_ s050000) 2 \ 0100000} 

The state-diagram showing the behavior of the 
system is shown in the figure. 


dg 43 4, 45 


q 3.7 
q 23 of 
q io. ach 
3 7 
7 - q3 ‘ : 
q, ao. od 
qt 7 3 
5 


A 
The Ergodic set is Q= {9149245594245} » and 


the steady-state probability vector is 


A a» an Aw AN 
+ = < Ty To TM Ty Ts? 
= €0.114 0.332 0.080 0.186 0.288? 
= i = .O 
ae 1.7374 Mie! 
PF = 4.029 


Thus, for the expected job-profile 4, the effec- 
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tive processing power of the pipelined processor 
is about 4 times that of a corresponding non- 
pipelined machine. 


Table I shows some of the results of an analysis 
of the Arithmetic-Logic Unit of the TI-ASC. This 
unit is configurable and requires the complete 
pipe to be flushed everytime a new configuration 
is to be set up. The expected performance in a 
scalar processing environment, in a vector pro- 
cessing environment of several different expected 
vector lengths, and the ideal throughput poten- 
tial are shown. These are compared with the ex- 
pected performance if the ALU was in fact not 
made configurable and hence not burdened by the 
reconfiguration overheads. As can be seen, the 
improvement in a scalar-processing environment is 
quite significant, but is insignificant in a vec- 
tor environment with even moderately large vector 
lengths. A more detailed study and relevant dis- 
cussion will be found in [6]. 


TABLE I : ANALYSIS OF THE TI-ASC ALU 
(10 Functional Pipes considered) 


Processing 
Environment 


Configu- Non- Improve - 
rable Configu- ment* 
rable 


1.190476 2.869866 2.410688 
2.986179 3.995209 1.337900 
Vector (L=100) 4.362761 4.530451 1.037284 
Vector (L=1000) 4.614627 4.633850 1.004166 
Vector (ideal)*** | 4.645000 4.645000 1.000000 


Scalar 
Vector (L=10)** 


* PF(non-configurable)/PF (configurable) 
** L is the expected vector length 
*k* L is considered infinitely large 
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A CRAY-1 SIMULATION USING PASCAL-PLUS 


R.H. Perrott and C. King 
Department of Computer Science 
The Queen's University of Belfast 
BT7 INN N. Ireland 


Summary 


This paper reports on an experiment in par- 
allel program construction, namely, the simulation 
of the computation section of the Cray-1 computer 
[1] using the language Pascal-Plus [2]. Pascal- 
Plus is an extended version of Pascal which pro- 
vides the user with parallel features. 


As a result of this project we now have a 
working model of the Cray-1. The model accepts 
programs written in CAL, the Cray-1 assembly lang- 
uage, and produces a summary of the usage of the 
functional units, the memory accesses, the amount 
of scalar and vector computation, the run time of 
the program, the MFLOP and MIP rates. In addition 
the user can request an instruction by instruction 
trace of a program. 


Previous simulations of the Cray-1, such as 
that at the University of Michigan [4], have been 
constructed using a sequential programming lang- 
uage like Fortran. We found that the parallel 
features, program modularisation and data abstrac- 
tion facilities of Pascal-Plus were well suited 
for the simulation of the concurrent activities of 
the Cray-l. 


The language used for the simulation was 
Pascal-Plus, full details of which may be found 
in [5]. However the salient features which were 
used in the construction of our model are des- 
cribed below. 


Pascal-Plus is an extended version of Pascal 
which was specifically designed to support paral- 
lel processes and to enable discrete event simul- 
ation. The language extensions are the envelope 
structure which is an aid to program modularis- 
ation and data abstraction, the process, monitor 
and condition structures which provide a means of 
representing parallel processes and controlling 
their subsequent interaction, and, a simulation 
monitor, which provides pseudo-time control fac- 
ilities for parallel programs. 


The envelope is used to define a data struc- 
ture and all the operations that can be performed 
on that data structure; the operations are repres- 
ented by means of procedures or functions. In 
addition, there is a control structure which 
brackets or envelopes the execution of any block 
which creates an instance of the data structure. 
In this way the user can ensure that certain 
actions can be performed before and after the 
execution of the block in which the instance of 
the data structure is declared. 


In our model the envelope was used to 
represent the collection and the output of 
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statistics for each program executed by the simul- 
ator. In this way the design, development and 
construction of the statistic collection module 
could be isolated from the development of the rest 
of the model. 


The overall structure of this envelope was 
as follows:- 


envelope statistics ; 
declaration of local data procedures and 
functions ; 
begin (* body of the envelope *) 
initialisation of the data ; 
kek ; (% inner statement *) 
finalisation 
end ; (* statistics *) 


Instances of this envelope, as many as the pro- 
grammer requires, can be declared as follows:- 
instance timing : statistics ; The block in which 
‘timing’ is declared is then executed in place of 
the inner statement which is represented by '***! 
in the body of the envelope. 


Only the procedures and functions of the 
envelope which are prefixed by an asterisk '*! 
i.e., starred can be called by the statements 
comprising the body of the block in which 'timing' 
is declared; starred data identifiers can also be 
accessed but in read only mode. All other ident- 
ifiers and procedures are therefore protected. 


The envelope was found to be a useful 
abstraction mechanism in this simulation experi- 
ment, The block which declares and uses the 
facilities of the ‘statistics' envelope requires 
no knowledge of its representation or its initial- 
isation or finalisation phases. Hence it could be 
constructed separately and even modified at a 
later stage provided none of the starred ident- 
ifiers were changed. 


Processes are used to identify any independ- 
ent actions which may take place in parallel, for 
example, the execution of the functional units. 

A process can be defined and then instances of it 
declared, similar to the way in which envelopes 
are defined, The inner statement of the block 
in which an instance of the process is declared 
represents the execution of the body of the pro- 
cess. Once activated the processes proceed con- 
ceptually in parallel until they terminate where- 
upon the finalisation statement (if any) is 
executed, 


A monitor [6] consists of the data which 
several processes wish to share and the procedures 


which can manipulate this data; the data can only 
be accessed and updated by a single process at a 
time. Thus a monitor provides a means of control- 
ling communication and interaction among the pro- 
cesses by guaranteeing exclusive access to the | 
data. 


If a process enters a monitor to update a 
shared variable it may have to be suspended pend- 
ing the action of another process; this is achieved 
by means of condition queues. Hence within a mon- 
itor for each condition that must hold before a 
process can continue, a queue is required. Pro- 
cesses wait on a queue until signalled by another 
process to continue. 


The user can declare these queues as follows 
instance unitqueue : condition ; To suspend 
itself on a condition queue a process performs a 
wait operation as unitqueue.wait. To release a 
process from a queue another process performs a 
signal operation, indicating to the signalled pro- 
cess that the reason it was delayed no longer holds 
as unitqueue.signal. Thus processes and monitors 
are the basic structuring tools for programs 
involving parallelism. | 


Each functional unit of the Cray-1 was rep- 
resented as a process. All the functional units 
have a similar structure in that they oscillate 
between periods of activity and inactivity. When 
they are inactive they wait on a condition queve 
until requested by an instruction to perform their 
function. Because of the similarity in their 
structure an array of condition queues and pro- 
cesses was declared. 


The structure of a functional unit process 
is such that after creation it will wait on a 
condition queue, when it is signalled by another 
process it enters an infinite loop. The loop con- 
sists of periods of activity and then waiting on 
its condition queue again. 


One instance of this process for each of 
the functional units is declared; a parameter is 
used to distinguish between them. Each functional 
unit process is initiated whenever the inner state- 
ment of the block in which it is declared is 
encountered. The order of creation is the same 
as the order of declaration of the processes. 


A simulation monitor is included in Pascal- 
Plus in order to provide facilities which help 
with discrete event simulation. The main feature 
is an ordered queue known as the time queue on 
which processes suspend themselves for a period of 
pseudo-time using a procedure 'hold'. The queue 
is organised so that processes with early wake up 
times are at the front of the queue; the wake up 
time for a process is the sum of the current time 
plus the parameter of the 'hold' procedure. 


Thus for a functional unit process the 
period of activity is represented as a call to 
this monitor procedure. For example, simulation. 
hold (holdunittime) where the parameter ‘holdunit- 
time' represents the period of time for which this 
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particular functional unit is meant to be execut- 
ing. 


Simulation time only advances when all pro- 
cesses are suspended either on a condition queue 
or on the time queue. Only then is time advanced 
to the wake up time of the first process on the 
time queue. All processes waiting for this value 
are then reactivated. 

The simulation terminates when all the pro- 
cesses in the model are waiting on condition 
queues. 


This section describes the structure of the 
model and some of the problems we encountered 
during the design phase. 


Our original plan was to model the comput- 
ation section of the Cray-1 as a series of 
dynamic and static resources; the former being the 
functional units and the latter being the memory 
and the various registers. In this way each 
dynamic resource could be represented by a process 
which would lie dormant on a queue until it is 
presented with operands and asked to perform its 
function. 


The static resources were to be assigned to 
various monitors in which they would be protected 
from the unpredictable effects of parallel pro- 
cesses. Whenever a functional unit required a 
particular register it would make a request to the 
appropriate monitor. If the request could not be 
satisfied the process or functional unit would 
have to wait on a condition queue until it became 
available. : 


However two situations complicated this design 
decision and caused the model to be restructured:- 
a functional unit could become free before the 
registers that it was using; our scheme implied 
that the acquiring and releasing were performed by 
the functional unit process, and, the technique of 
chaining, where the result operand of one func~ 
tional unit is fed to another, caused a similar 
type of problem about the releasing of registers. 


To surmount these difficulties the classific- 
ation of dynamic and static resources was changed. 
The registers were described by processes so that 
they could be released before the functional unit 
which was using them. 


The structure of these register processes 
was similar to that of the functional units, and 
defined and declared accordingly. 


The memory was also regarded as a functional 
unit and treated as a process for timing consid- 
erations. Our model did not take account of the 
complex timing mechanism relating to the issue of 
instructions whenever bank conflicts occur. 


We found the program and data structures of | 
Pascal-Plus well suited for the representation 
and manipulation of parallel events. The major 
benefit of Pascal-Plus in comparison to a sequen- 
tial programming language is the ease with which 
concurrent operations can be specified and 
synchronised. 


SESSION 5: RESOURCE CONTROL AND ALLOCATION 
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Abstract 
This paper describes hardwired resource 
allocators for TRAC-like reconfigurable 
architectures. These allocators facilitate 


searching for available resources in the system 
and allocation of a subset of these to a given 
request. Various algorithms can be implemented 
for the search and the allocation of _ the 
resources. Tree-structured allocators look 
particularly attractive with the  cost-delay 
product being of the order of M*(log M)2for a 
system with M resources of the same type. The 
paper also describes how this scheme can be 
extended to allocate multiple type of resources in 
the system. 


1.0 Introduction 


Conversion of software functions into 
hardwired modules looks attractive because of the 
promise of improved execution speed. Due to the 
recent advances in semiconductor technology the 
trade-offs involved in cost-speed functions have 
favored increased speed at a small increase in 
hardware cost. This trend in decreasing hardware 
costs has encouraged system designers’ to 
incorporate many of the software functions, 
specially those related to the operating systems, 
into hardware modules [8], [13]. For example, 
some architectures provided hardwired functions 
for manipulating capabilities [3]. The Symbol-2R 
[8] architecture had a hardwired supervisor to 
support a time-shared environment, and had 
features for direct execution of high-level 
languages. In some of the IBM 360 series machines 
table-look-up and address translation functions 
for paging systems were made faster using 
associative memories. CASSM [11] and similar kind 
of architectures [6] proved that many of the 
conventional software functions in data-base 
applications could be easily transplanted into 
hardware to enhance the overall system 
performance. PASM [9] architecture uses separate 
microcontrollers to control partitioned processor 
arrays in SIMD mode. These controllers can 
selectively mask some of the processing elements 
(PEs) in the array by using a mask vector which 
specifies the addresses of the PEs to be masked. 
More than one PE can be specified by using 
don't-care values in the address tuples. The 
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scheme provides an intelligent mechanism to 
specify and decode the addresses once it is known 
which PEs are to be selected for execution. 


In this paper we show that the resource 
allocation for architectures such as TRAC [10] can 
be hardwired. The scheme presented in this paper 
is useful in deciding which of the available 
resources be allocated to a given request for 
partition. Such decisions are dependent on the 
algorithm implemented in the hardwired resource 
allocator. In fact, the kind of resource 
allocator we present here can be used in any 
architecture having large number of identical 
modules as assignable units. If the system has 
more than one type of assignable modules (e.g. in 
addition to memory modules it may have disks, 
tape-drives, or printers), then this approach can 
be easily extended to cover such cases. But it is 
especially useful in allocating resources that 
need to be set up quickly, such as components of a 
switch or of shared memory. The structure 
presented here is a step towards decentralization 
of control. The hardware cost of this kind of 
resource allocator is of the order of M*log(M) 
when there are M assignable modules in the system. 
Tree-structured allocators are particularly 
attractive because delay is of the order of log M. 


We have used TRAC as a model to present our 


thesis that hardwired schedulers can _ be 
effectively used to implement a= range. of 
algorithms for resource allocation on 
reconfigurable machines. Per formance 
(effectiveness) of scheduling algorithms for 


architectures based on networks, such as_ banyans, 
can be highly dependent on the mix of the jobs to 


be scheduled on the system. This paper neither 
proposes nor claims effectiveness of any 
scheduling algorithm for reconfigurable 


architectures like TRAC. This kind of study of 
scheduling algorithms for banyan network based 
architectures is being done elsewhere [2]. 


One of the basic philosophies in scheduling 
large, modular multi-processor architectures is 
that the software scheduler should maintain only 
minimal amount of information on the global. 
system-state. The scheduler should transmit 
parameters to the hardware which allocates the 
resources. The hardware should allocate resources 
to avoid blockage, faults, etc., and respond with 
a success or failure signal to the scheduler. If 
done with care, this philosophy can permit 


distributed control of the switch which enables 
parts to be controlled independently, removes the 


centralized controller and most importantly 
removes communication paths (pins) between the 
central controller and the switch. Using the 


resource allocators presented here, we find that 
there is little need to maintain even the list of 
available resources in the system. 
Maintaining such information can be very useful so 
the scheduler will not attempt to allocate 
resources when there are not enough currently 
available in the system. 


In this paper we present two strategies to 
search for available resources in the system, and 
two algorithms to select a subset of available 


resources. The selection algorithms when used for 
TRAC like architecture require additional 
hardware. In contrast, the search strategies can 


be implemented using the logic of the banyan 
Switch as used in TRAC. The constitution of the 
rest of the paper is as follows: the next section 
presents the problem description; section 3 
describes algorithms for search and _ selection; 
section 4 describes algorithms for search and 
selection of resources; section 5 presents’ the 
functional design of hardware structures and the 
required control logic; and finally section 6 
presents some ideas on multi-type resource 
allocators. 


2.0 Problem Description 


In the architectures which we will be 
primarily concerned with in this paper a set of 
resources — processors, memories, I/O devices - is 
connected by a Switching network which is used to 
partition these resources into independent 
processing structures. Goke [4] showed that 
banyan networks are suitable for this purpose, and 
an architecture based on banyan networks was 
proposed in [5]. We will be using’ this 
architecture to demonstrate our ideas on hardwired 
resource allocation. In the following paragraph 
the logic for setting up such partitions is 
briefly described. We show that if a certain 
partition on the switch is requested, it is either 
granted and an acknowledge signal is returned to 
the scheduler, or a failure is signalled in case 
of a blockage. 


In [5] the partitions consist of data-trees 
and instruction trees. A data-tree connects a set 
of memory modules to a processor called the root 
of that data-tree. Instruction-tree connects a 
set of processors which are roots of data-trees to 
facilitate SIMD mode of operation. In [4] the 
basic logic of setting up such partitions is 
described, a more detailed design of such a 
Switch, as used in TRAC, can be found in [7]. In 
TRAC a four-level banyan switch with spread=2 and 
fanout=3 is used; the switch has 81 base and 16 
apex nodes. The memory-modules and the I/0 
devices are connected to the base nodes, and the 
processors are connected to the apex nodes. One 
of the base nodes can be used as a port into which 
the software scheduler feeds commands and 
addresses into the hardware resource allocator. 


However , | 


processors to each of the 
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| In order to set-up some partition on _ the 
switch, the scheduler has to first decide which of 
the available memory modules are to be chosen as 
candidates. A bus connects the port (feeding 
commands to the switch) to all the resources, so 
they can be addressed, like I/O devices ina 
Microcomputer. The selection procedure then 
selects and marks a subset of these memory modules 
sequentially using the control bus. Marking is 
done by addressing the resources, as I/O devices 
are addressed in microcomputer, and setting a 
flip-flop. To connect a set of memory modules (as 
data-tree leaves) to a processor, the marked 
modules send a request Signal up towards the 
processor nodes. This request signal, at every 
node in the switch, proceeds upward to all links 
immediately above it, through all the levels to 
the apex processors. If the request signal finds 
a busy or faulty node at any level, then from that 
node upwards a denial signal is sent. The denial 
Signal, like the request signal proceeds upward to 
all links connected immediately above it, through 
all levels to the processors. All those 
processors which do not receive a denial signal 
are candidates to be the root for the desired 
tree. There exist unblocked paths from these 
requested memory 
modules. One of these processors is selected as 
the root of the data-tree. Then from this 
processor a grant signal is sent towards the 
Memory nodes. This grant signal proceeds toward 
the memory, going out on each link below each node 
that it gets to. Any link getting both a_ request 
and a grant signal is part of the tree. Such a 
link switches itself to a conducting state to form 
paths between the processor and the requested 
memory modules. The set oof links, though 
physically a tree, act like a wire-OR or a 
tristate bus connecting the resources. | 


Because of the blocking nature of the switch 
it is possible that none of the processors can be 
connected to all the requested memory modules. 
Such a partition is said to _ be blocked anda 
negative acknowledge is signalled to_ the 
scheduler. Generally, a process requires a set of 
memories and I/O devices. Some resources have 
data in them required by the process. These are 
care resources, but other resources need only be 
selected from a pool of similar resources, such as 
empty memory modules. If the request for some or 
all of the memory modules is of don't-care type 
(i.e. any set of memory modules can be _ selected 
from the available modules), then the request for 
setting up the data-tree is retried with another 
set of modules. Testing all don't cares to 
generate sets of specific resources to form a data 
tree is an NP-complete problem, and a serious 
limitation in the scheduling mechanism used in the 
TRAC. In this paper we study how this selection 
can be done by hardwired allocator algorithms. 
These algorithms are defined in the next Section. 
Our interest in the resource allocation problem 
was mainly instigated because of its existence in 
the TRAC system. 


TT Ne A meee, 


Resource allocation has three phases: search 
for qualified resources, selection and validation 
of a subset for allocation, and finally granting 
of resources to the request. It is important here 
to understand what we mean by the term "qualified 
resources". 
subset of the available resources, and various 
criteria can be used to designate an available 
resource as qualified. These criteria determine 
what strategy must be used to find the qualified 
resources in the system. The second phase is to 
select a subset of the qualified resources and 
validate them if the desired partition can be 
set-up. In the last phase, on_ successful 
validation, these resources are marked unavailable 
for use in further tree allocation and the 
partition is set up. The search and_ selection 
algorithms might make use of some properties of 
the interconnection network. In case the selected 
set of modules cannot be connected, the selection 
phase retries another subset of the qualified 
modules. 


3.1 Search Srategies for Qualified Resources 


In this section we describe two strategies to 
search for qualified resources in the system. 


Strategy 1: The simplest way to define qualified 
resources is to designate every available resource 
as qualified. Therefore, the selection algorithm 
considers all the available resources when 
selecting a subset for validation. A tag bit can 
be associated with each module to indicate whether 
it is available or busy; when the resource is busy 
this bit is reset, otherwise it is set. Failed 
modules can be excluded from the set of qualified 
resources by resetting this bit. Using a tag bit 
on each module to indicate its availability, there 
is no need to have a search phase, because all the 
avallable modules are tagged, which also signifies 
that they are the qualified modules. 


Strategy 2: This strategy shows how qualified 
resources can be defined on the basis of some 
properties of the interconnection network. In the 
proposed strategy, a Signal is sent down from an 
unused processor towards the memories. The signal 
propagates downward only if it does not encounter 
any busy node on its path. All the base nodes 
which receive the signal are designated qualified, 
and are connectible to this processor. If the 
number of qualified modules is greater than or 
equal to the number requested for the data-tree, 
then the allocator algorithm allocates a subset of 
these and requests the data-tree formation. 


If the number of qualified modules is less 
than the desired number, then this procedure is 
repeated with the next available processor. If 
all the available processors have been tried, that 
Means that the requested data tree cannot be 
loaded at the current state of the switch. 


The advantage of these strategies is that 


there is no need to maintain a list of available 


The set of qualified resources iS a: 


resources with the scheduler. Whenever needed, 
the scheduler can generate this information in a 
few memory-cycles. 


3.2 Resource Selection Algorithms 
selection 


We will be considering two 
algorithms: first-fit and group-fit. 


First-fit Algorithm: In this algorithm all the 
qualified “modules are serially assigned a number, 
the order in which this number is assigned can be 
a function of the network topology. In banyan 
networks this numbering can be based on the 
addressing scheme [12] which implicitly captures 
the concept of the distance function [4]. For a 
data-tree request of "rr" modules the scheduler 
allocates the first "rr" number of qualified 
modules. The switch controller then attempts to 
form the data-tree, in case of a failure the first 
module in the selected set is replaced by the 
(rrt+l)'th qualified module. The data-tree 
formation is retried with this new set of modules. 
This process is continued either for a fixed 
number of attempts, or until the request with the 
last available module is tried. This can be 
viewed as a moving selection window of width "rr", 
which is moved one step from left to right for 
every attempt until the requested partition can be 
set-up. | 


In the design of the resource allocators 
presented here, we show how this selection method 
can be used with the search strategies 1 and 2. 
This selection algorithm, when used for TRAC-like 
architectures along with the search strategy 2, 
does not need retries because any partition with 
the qualified resources is guaranteed to be 
unblocked. 7 : 


Group-fit algorithm: In the systems which have 
resources organized in groups on the basis of 
physical proximity, it may be desirable to select 
all the resources for a request from the same 
group. These groups may possibly be divided into 
smaller subgroups. For example, in TRAC which has 
a fan-out of 3, the smallest group size for the 
base nodes is 3, and it increases as powers of 3, 
e.g. next larger sizes of groups are 9, 27, amd 
81. A group of size 9 contains three subgroups of 
size 3; similarly a group of size 27 contains 
three subgroups of size 9. Nodes belonging to the 
same group are "closer" (in terms of the number of 
links in the smallest sub-tree which can connect 
them in the banyan network) to one another as 
compared to any node in a different group. 


A group-fit algorithm is encoded in a 
Simulator for TRAC [2]. For a data-tree request 
it tries those memory modules, from the list of 
available modules, which belong to the same group. 
(Note that in TRAC the resource modules to ‘be 
assigned by the allocator are connected to base 
nodes of the banyan network.) The number of such 
modules is f!, where 1 is the number of levels in 
the switch. The algorithm is defined below for a 
banyan network with spread=s and fanout=f. The 
algorithm presented below is for the search 
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Strategy 2. 
data-tree is guaranteed if the required number (or 


more) of resources qualify during the search 
phase. . . 
{rx = number of resources r 


equested} 
set m such that £¥= rr < fmtl 


if rr <= number of available resources then 
begin ? 
for ns= m to 1 do 
begin | 
if rr modules are available from 
base node addresses 
in the range (1+(i-1)*£" )..i*£? 
then. allocate rr modules and request 
data-tree formation 
end; 


If the number of base-nodes is M and the number of 
processors is P, then the complexity of the above 
algorithm is. O(P*log M) for Strategy 2% 

4.0 Functional Design of the Resource Allocator 


The designs which we propose here support 


search and selection phases of resource 
allocation. Validation and granting would 
normally be supported by the control logic for 
partitioning and reconfiguration on the 


interconnection structure. 
of these hardware allocators is presented here. 


When designing hardware, a tradeoff in cost 
and speed is always encountered. To reduce the 
cost of a design, it is desirable to take 
advantage of low IC replication costs. This 
requires a design constraint on the number of IC 
pins. While to gain speed, structures with 
Minimal delay are required. In the hardwired 
resource allocator designs, this tradeoff is 
guided by the Operating environment. 


We propose two. types of allocator designs: 
first is the Tree Structured Allocator, and the 
second is the Linear Structured Allocator. The 
first allocator optimizes on speed and provides 
two types of algorithms. One is the First-Fit 
algorithm, where the resources are allocated on 
the first available basis. The second is the 
Group-Fit algorithm; here. we assume that the 


In this case the setting up of the — 


The functional design | 


‘the 


system network structure divides the resources | 


into groups. This grouping 


ean be done. 


intentionally, or can be a side effect of the. 


interconnection network.. The Linear Structured 
Allocator is aimed at optimizing hardware cost. 


The delay of the allocator is proportional to the 


number of qualified resource modules. Only the 


‘Pirst-fit ‘algorithm can be ‘Supported by this kind 


of allocator design. 


| “For the discussion of the allocators given 
below, we assume that the resources are of the 
same type. Extending the algorithm to handle 


multiple types of resources is not difficult, a 
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hardware, 


‘resources requested by the task. 


solutions for this are given ina pater section of 
the paper. 


4.1 Tree Structured Allocators 


achieve “high Speeds in acost effective way, we 
chose a tree interconnection structure for 
resource allocator. This 
structure is separate from the interconnection 
structure used for — reconfiguration and 
multiprocessing. The resources are attached at 
the leaves, or at the leaf and internal nodes of 
this tree structure. The tree structure is used 
to calculate the number of qualified resources to 
the left of each node/leaf and also the total 
number of qualified resources in the system. This 
structure also assigns a serial number to the 
qualified resources. 
proportional to log(M), where "M" is the total 
number of the resource modules connected to the 
tree. 


the . 
interconnection 


Each node of the tree looks like an adder 
[1]. If we assume a binary tree, then each node 
has 3 pairs of directed links incident on it 
(Fig. 1). Each pair has its left link going down, 


Y+2Z 


Figure 1 Ponceicant view of the Tree-adder 


node 


while the right link goes up. If a number X is 
placed on the link PD, and numbers Y and Z are 
placed on the links LU and RU respectively, the 
resulting sum of these numbers is shown at the end 
of links PU, LD and RD in Fig. 1. The PU link of 
root node outputs the total number of 
qualified resources in the system. | 


algorithm in 
we assume that the resources are placed 
arbitrarily as the. leaf nodes. The . algorithm 
requires each resource to indicate if it is 
qualified. This is done by appropriately setting 
a flip-flop QR in each qualified resource. Then 
QR true indicates the resource is qualified, while 
QR false indicates it is busy. It is also 
necessary to store the value "rr", the number of. 
A suitable bus 
will be assumed for this transfer to take place 
from the. controller to the resource ‘modules. | 


To explain the working of our 


This is done in time 


Since only the qualified resources should be 
considered for allocation, we need to sum only the 
values of all ORs. Thus the numbers Y and 2Z (in 
Fig. 1) are the QR values of the respective 
resources connected at those locations, while X is 
set to zero to indicate that there exist no 
resources to the left of the left most leaf. This 
is best explained by studying Fig. 2. Here a 
system of 8 resources 


CONTROLLER 


Resource | 
Select } 
Line 


Resource 
Module 


Wy :* Indicates busy resources 


oO: Indicates qualified resources 


Figure 2 Tree resource allocator example 


is connected as explained above. Each directed 
pair of links of Fig. 1 is shown as a single link 
between nodes, while the arrows indicate the 
directed links. The value at their head gives the 
respective sum at that link. 


The total number of qualified resources in 
Fig. 2 is obtained at the root and is sent to the 
controller. If "rr" for a task is less than or 
equal to this total then. the controller knows the 
task can be scheduled. The controller then 
asserts the resource select line to assign the 
resources to the processor. Each resource that 1s 
qualified and has a sum less than "rr" on its LD 
(or RD) link, automatically selects itself. The 
sum on the LD (or RD) link would be the serial 
number assigned to the resource. After the 
selection, the next phase of validation might be 
necessary if search strategy-l is being used. 


To implement the moving window of search 
Strategy-l, to shift the window to the right we 
decrement the value X being fed into the PD line 
(figure 1). The controller does this’ when 
required, and stops this process on the last 
qualified module being tried in the window. 


The example given above presents one type of 


architecture for tree structured resource 
allocators. We give below three different types 
of tree-structured architectures, two for the 
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First-Fit algorithm, and one for the Group-fit 
algorithm. These are implemented assuming search 
strategy-2 is used for selecting the resource 
modules. The cost of such resource allocators is 
of the order of M*log(M) when there are M resource 
modules. The linear term specifies the cost of 
the tree nodes, while the log(M) term gives the 
cost of the links in the tree structure. 


4.1.2 First-fit Algorithm -- Binary Tree With 
Resources At The Leaves The resulting tree 
structure obtained on connecting the resources is 


Shown in Fig. 2. Each leaf node is the resource 
module itself. A block diagram of the internal 
nodes (including the root) is given in Fig. 3. 


The full adders of a node can be 


PO PU 


wf /w RD RU 


Figure 3 An internal node of a Binary tree 


allocator with the resources 
attached to the leaves 


SN74283 [14]. A separate bus to pass "rr" from 
the controller to the resources is not required. 
Instead we feed "rr" into the root nodes PD link 
and then set the B port to 0 to get the F=A 


function. Looking at Fig. 3, we see our tree 
would be functionally equivalent to a bus. 

The width of each link is w= flog, (M)1 and 
the number of wires per node required is’ 6w. This 


is a large number if the node is to be implemented 
aS an IC module. But we notice that the number of 
pins can be reduced to 6, if we serialize the data 
in the links of the tree . This would be at the 
cost of a reduction in the execution speed of the 
algorithm. But this cost may be overshadowed by 
the cost of implementing 2 to 3 nodes on the same 
IC module. 


Assuming a parallel data transmission tree 
structure, the following execution times are 
obtained. Here ND is the delay imposed by the 
node, while "M" is the total number of resources 
connected to the tree. The time taken to compute 
the total number of resources is "Ta", while the 
time taken to serialize is "Ts". 


[1ogoMl * ND 


(2* flogoml - 1)* ND for MQX2 


This architecture can be used when minimal 
design changes to the existing resource module are 
wanted, or when the resource modules are placed 
far apart. Here the consecutive leaf nodes can be 
the resources that are physically close to each 
Other. This would reduce the length of the link 
(wire) connecting them to the tree. 


LS TS A TT A | GORA 


resource module itself. This way each leaf and 
internal node, including the root of the resulting 
tree 1s a resource module. This approach reduces 
the number of nodes and levels in the resulting 
tree. The node design will have to be modified as 
Shown in Fig. 4. 
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Resource 
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LD PLy RO RU 


A node of a Binary tree structured 
allocator with the resources 
attached to all nodes 


Figure 4 


Like the above described design we can use 
SN74283 ICs. This structure has the advantage 
that no external tree need be implemented. The 
resource modules would have to be linked to their 
neighbors in the appropriate manner. Resources 
connected to the tree as shown in Fig. 5. The 
node numbers indicate their physical location, 
with node 1 assumed to be the left most node. The 
node allocation basically follows an in-order tree 
traversal path. This arrangement reduces’ the 
length of the links (wires) in the tree structure, 
which is necessary to reduce the cross talk and 
the cost of the tree. | 


The cost can be further reduced by 
simplifying the leaf nodes to contain only the 
serial register MR, the QR flip-flop and the 
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CONTROLLER 


Assigning resources to nodes of 


Figure 5 
the tree structured allocator. 


The 
a 
data transmission tree are 


resource control hardware (as in Fig. 2). 
number of pins required on each node IC for 
totally parallel 
w= [log , (Mm) 1*6. 


The execution time for this tree structure is 
as follows 


Ta = ( Flog, (m + 1)1 -~1)* ND 
Ts = (2 flog,(m+i1)|- 3) *ND form y2 
This architecture can be used, when the 


resource modules are implemented as ICs, or are 
placed physically close to each other. The node 
can then be placed within the IC, and using serial 
data transmission the pins can be reduced. Even 
if an IC implementation is not required, the node 
design given above is cheaper in terms of 
hardware, if the leaf nodes are simplified as 
explained above. 


4.1.3 Group-fit Algorithm In networks which 
have their resources organized in groups, it may 
be advantageous to select the resources from a 
Single group. If we can allocate all the required 
number of resources from the same group, then we 
can save on the number of links allocated. This 
saving is in terms of reducing the possibility of 
blocking other available resources in the 
following allocations [4]. 


the TRAC 
is presented here. Other 
applications can use this by modifying the node 
design as required. We use a ternary tree, with 
the resources attached to the leaves. This allows 
for a simple node design as shown in Fig. 6 A 
request for partition is sent to the root of the 
tree, from which it moves towards the leaf nodes 
as described below. A node selects a particular 
son if that subtree has the required number (or 


An example of this design for 
architecture [10] 
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A node of the group selection tree 
structured allocator 


Figure 6 


more) of qualified resources, amd none of its 
elder brothers (if any) have the required number 
of qualified resources in their subtrees. If none 
of the sons satisfy the above condition, then the 
first "rr" resources in that subtree are selected. 
To correctly identify the active subtree an extra 
line is required in the node. This is set if the 
parent of a node has selected this branch of the 
tree (i.e. line PA). 


Qualified resources transmit their QR values. 
The resulting sum at the root and the serial 
numbers are calculated. At the root cell a 
comparator checks the sum, the PA line is asserted 
if there are equal to or more than "rr" qualified 
resources. A similar check is done at all nodes 
and the respective control lines are asserted (see 
Fig. 6). If a node lies in an inactive subtree 
then the LU, MU and RU links are pulled down to 0. 
This can be done by designing them to be wire-OR 
lines. At any node in the tree, the respective 
lines are then made 0 if their compare function is 
0. This extra hardware (not shown in the Fig. 6) 
is needed to correctly serialize the resources 
that are in the active (selected) subtree. The 
resources allocated are those which are qualified, 
have their PA set, and have serial numbers less 
than "rr". 


_ The tree will have the following worst case 
timing :-- 


— 
—s 


Ta Plog 3M] * ND 
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ND for 1<M<¢ 3 


(3 flogym} - 1) * ND 


wail 


4.2 Linear Structured Allocator 


for M> 3 


First-fit Algorithm This architecture does 
not require the tree structure, instead it uses a 
counter 1n each resource module (see Fig. 7). 


Each resource module requires a 
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register (MR) to hold the value "rr". A_ single 


line bus (RESREQ) from the controller is used to 
serially transmit this value to all resource 
modules. Another line INC, to increment the 


counter is also required from the controller to 
all modules. Each module constantly compares the 
counter output with "rr", and sets the DONE line 
aS soon as an equality match is obtained. 


The resource modules are qualified to 
participate in the present allocation run, as done 
in the other designs. The QR flip-flop is 
appropriately set, all counters cleared, and the 
MR register initialized. The controller then 
sends pulses on the IN line, and all enabled 
counters are incremented with each pulse. A 
counter is enabled if QR is set and the resource 
has not set its right neighbor link (see Fig. 7). 
The counter is enabled as long as the left 
neighbor link is not set. On this being set the 
module increments the counter for the last time, 
and sets its right link. After this for all 
following pulses the counter is not incremented. 
If an intermediate module is not qualified, then 
its left link is internally joined to the right 
link, and the counter cleared and disabled. 


The incrementing process is stopped as _ soon 
as the DONE line is set. It is set by a qualified 
module, when its counter reaches the value "rr". 
All resource modules with counter values less than 
or equal to "rr" select themselves. The counter 
value for each selected resource is its serial 
number. If the number of qualified resources are 
less than "rr" then the RLINK line is set before 
the DONE line. At this point the controller would 
select the next processor (if any), and do the 
above. 


This architecture has the advantage of having 
a low cost, since only few lines and minimal logic 
is required. It furthers lends itself to be 
implemented within the resource module itself. A 


Significant loss in speed will be felt only when 
the "rr" for the allocation is large. This is 
because the allocation is sequential and has the 
time complexity of O(rr). 


5.0 Ideas on Multi-type Resource Allocators 


The above described allocator algorithms 
considered resources of the same type. 
extend them to handle multi-type resources, as 
described here. The actual solution chosen, would 
depend on the cost-speed tradeoffs of the system. 


5.1 Tree Structured Allocators 


There are two types of multi-type resource 
allocators -- Multi-pass and the Single-pass. In 
the Multi-pass allocators all the resources of the 
system are connected to the hardwired allocator as 
described in the single resource type case. Only 
the resources of the same type are qualified 
during a pass. If a task required 3 types of 
resources to be allocated, then 3 passes would be 
required. 


The Single-pass type allocator allows us to 
Maintain the time complexity of the algorithms 
described earlier. Here "M" would mean the 
cardinality of the largest resource type. On the 
other hand it increases the cost of the hardware, 
since the width of each iink is increased. 


Each link for the above defined algorithms 
had aoewidth of w [Log (M) | . To handle 
multi-type resources in a Single pass we have to 
increase the width to 

Ww Wy + w. + 


@oe2 @ @e @ FF We + #® @ @ © @ 


Where “w," 1s the width for resource type "1", and 
we have "n" different resource types. Each link 
is logically divided into "n" fields (Fig. 3), but 
would be considered to be a positive integer 
number by the 


th for the 
resource type 
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Figure 38 
aliocator 
tree adder. If a resource module is qualified it 
increments only its logical field value. No 


overflows between adjacent fields take place, 
since the field width for each resource type is 

= Plog, (M) | The serialization and the 
summation process of the tree adder would not be 
changed. But the OR flip-flop connection to the 
tree and the allocation procedure of the oo 
is slightly altered. 


The QR flip-flop output for a resource type 
is attached to the least significant bit of the 
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respective logical field. While the MR register 
in each resource module is divided into the above 
logical fields. For a resource module only its 
field is made active and the other fields made 
inactive. This is because during the comparison 
for allocating a resource, this active field in 
the MR register is used. It would compare itself 
with the resulting serialized number obtained at 
the resource node. The resource would be 
allocated as done before. The controller allows 
allocation only if all resource requests are 
satisfied. 


5.2 Linear Structured Allocators 


The Multi-pass and Single-pass algorithms can 
be implemented for this allocator algorithm too. 
In the Single pass allocator all the resources 
would be connected together, and only one type of 
resource would be qualified per pass. While for 
the Multi-pass allocator we would have a separate 
controlier for each resource type. Resources of 
the same type are connected to their controller as 
described earlier. The algorithm would execute as 
before, the only difference being that the system 
pe caceaned is interfaced to the controllers of 
each resource type. 

6.0 Conclusions 


In this paper we have shown how resource 


allocation functions for reconfigurable, 
mul tiprocessing architectures can be delegated to 
hardwi structures. Such hardwired resource 


Si iocators look very attractive for large systems 
because it relieves the scheduler from the burden 
of maintaining lists of available resources. Even 
the resource selection function, which would 
normally be done serially by the scheduler can be 
done in parallel using the tree-structured 
allocators presented here. This would reduce the 
overhead of serial communication from a central 
scheduler to the resource modules during the 
selection phase. The tree-structured schedulers 
look particularly attractive because of the cost 
and delay both being of the order of log(M), where 
the system has "M" number of assignable units. 
Secondiy, the tree-structured schedulers have an 
inherent capability to capture some of the 
topological properties of certain interconnection 
structures, such as banyans, and thereby providing 
a convenient mechanism to implement aé_ set of 
intelligent resource allocation algorithms. 
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Abstract -- An approach to the synchronization 
and scheduling of resources in a demand-driven 
data-~flow model is outlined. It is shown that 
demand evaluation provides a natural model for 
resource usage and yields elegant solutions to 
certain problems, such as the avoidance of busy 
Waiting and resource scheduling. A graph—based 
applicative language (FGL), with data—flow 
semantics, is first introduced for explaining our 
primitives for resource control. A textual 
version of FGL is later used for. presenting 
examples. The benefits of an applicative 
language in aiding well-structured design, and 
the clarity of data-flow models in making 
indeterminate behavior explicit, are also 
illustrated. 


INTRODUCTION 


Data—-flow models are well-known representations 
for achieving asynchronous and concurrent 
execution of applicative programs ( [11], [4]). 
The term '‘'data-flow' has, in the past, been 
synonymous with 'data-driven', since the 
execution of any operator in a data-~flow program 
1s initiated by the availability of its input 
data. Recently, a demand-driven execution model 
for a data~flow program has been proposed as the 
computational basis for an Applicative 
Multiprocessing System [14]. Demand—driven 
execution is based on the principle that the 
execution of any operator is initiated by a 
demand for its result, rather than by the 
availability of data. In comparison with 
data-driven models, the main advantages of a 
demand—driven model are in avoiding unnecessary 
computations and allowing conceptually infinite 
data~structures tO be constructed, without 
requiring them to be manifest all at once. 


The work presented here is motivated by the need 
to introduce the concept of a resource, and 
techniques for their control, in a demand-driven 
data-flow model. The emphasis is on a reliable 
and modular approach to designing many of the 
synchronization and scheduling functions of an 
operating system. 


This research is supported by the National 
Science Foundation under Grant MCS-77-09369 AOQl. 
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A demand—driven model of execution has led us to 
develop an extension for resource usage that is 
also "demand-—driven." It is worth noting that a 
model for resource usage based explicitly on 
demands is not ‘unrealistic since user programs 
may be viewed as making demands on the underlying 
operating system to allocate/access resources. 


Some aspects of resources that are ostensibly 
alien to a demand-driven data-~flow model are the 
following: : os 


1. In purely applicative models, there has 
heretofore been no notion equivalent to a 
reference to a data object, since all data 
objects are values and all computations are 
value-oriented. As a consequence, there is 
no notion of updating a data object. 
Instead, "modified" data objects are 
essentially new data values. A resource, on 
the other hand, requires some notion of a 
reference to it in order to be shared in a 
concurrent environment, and is updated for 
sake of efficient storage utilization. 


2. Pure data-flow programs are determinate, 
since their output is determined solely from 
their input data values, regardless of the 
timing of operations. On the other hand, 
due to the unpredictability in the timing of 
operations and the updatable nature of 
resources, the behavior of a resource could 
be indeterminate. | Since any legal 
interconnection of pure data-flow programs 


can only result in . determinate 
programs [12], it is necessary to explicitly 
introduce operators for expressing 


indeterminate behavior. — 


The main advantages of a demand-driven model for 
resource control are the following: | 


1. Since waiting is fundamental, to demand 
evaluation, the avoidance of busy waiting is 
accomplished without a need for explicit 
protocols for "putting to sleep" and "waking 
up" a task.. In fact, no. additional 
mechanism is necessary for creating and 
destroying tasks. 
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It is possible to eliminate the use of 
"bracketing" operations around ae critical 
operation, e.g. the operations  startread 
and endread around a read operation, as in 
Monitors [9]. These bracketing operations 
are effectively substituted by the demand on 
a resource to perform an operation and the 
return of a result by the resource. 


2. Unlike conventional models, where 
indeterminacy is caused by concurrent 
operations updating some shared global data, 
in a demand-—driven dataflow model, 
indeterminacy is a local effect that must be 
explicitly introduced, and is associated 
with the time-dependent arrival of demands 
or data at some operator. This has the 
advantage of being able to easily identify 
indeterminate behavior as occurring at 
well-defined points of the program. 


3. Since any indeterminacy must be 
introduced, the arbitration and 
of operations becomes 
user—programmable. 


explicitly 
scheduling 
explicit and also 


AN APPROACH TO RESOURCE CONTROL 


We summarize the salient features of our approach 
to resource control. A logical resource consists 
of two main components: 
—- the actual resource and associated operations 
on: 10 


-~ the specification of its synchronization and 
scheduling. 
In the interest of modularity, these components 
can be independently defined. At this stage, we 
wish to treat the actual resource as an abstract 
object whose structure and representation are not 
of critical interest. Hence, we will omit 
detailed definitions of the access operations. 
The isSues of access rights, i.e. the protection 
of resources from unauthorized or improper 


access, have not been considered in this 
presentation, However, owing to the modularity 
of our approach, such constraints can be 


specified separately. The focus for the rest of 
this presentation will be on synchronization and 
scheduling. 


Our solution to the problem of specifying 
synchronization is to indeterminately order 
coneurrent accesses by using queue primitives. 
It is perhaps worth noting that almost all 
synchronization schemes proposed in the 
literature rely on some mechanism for queueing. 
We feel the concept is basic to synchronization, 
and hence have introduced it explicitly as a 
primitive. Thus, multiple queues may be defined, 
and different types of accesses can be allocated 


different queues, the allocation policy being 
under the control of the programmer. The 
availability of multiple queues, each 


independently accessible, also overcomes the 
problem of a single input queue bottleneck. 


Scheduling Consists of selecting some order for 
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serving these queues. Hence, primitive operators 
are provided for waiting and removing requests 
from queues. In a model where demands are the 
only means of initiating the evaluation of any 
operation, the evaluation of the actual operation 
to be performed on the resource may be viewed as 
being intiated by two demands: the first is the 
user's demand to access the resource; the second 
is the demand from the resource scheduler to 
start evaluation. Thus, primitive operators for 
evaluation control are also provided. When used 
in conjunction with the queuing operators, these 
operators allow a_ programmer to tailor’ the 
scheduling of evaluation of concurrent accesses 
according to many desired specifications. 


FUNCTION GRAPH LANGUAGE 


eens: 


The data-flow language described here is a 
graphical variant of a pure applicative language, 
and is called Function Graph Language (FGL) 
( £143, is ae An FGL program is a "graph 
grammar" in which each production rule associates 
a programmer-defined node (the antecedent of the 
production) with a directed-graph (the consequent 
of the production). A well-formed graph is any 
arbitrary interconnection of nodes and arcs, 
including cycles, with the following properties: 
1. The graph, and each node in it, has a 
(possibly empty) set of arcs directed into 
it, called input arcs, and exactly one are 
directed out of it, called the output arc. 


2. Every are in the graph (except for its input 
and output ares) is directed between two 
nodes. An output arc may fan out into two 
or more arcs, but merging of arcs into a 
single arc is not allowed. 


Every node has a name which may be either 
pre-defined or programmer-defined, i.e., the 
antecedent of a graph production. 


a) 
e 


The nodes in the graph represent operators, and 
pre-defined names correspond to primitives of the 
underlying machine. An operator is a _ pure 
funetion whose output is determined solely by its 
inputs, and does not have any side-effects. Arcs 
represent data paths between operators, the 
actual data values being either atomic, e.g. 
integer, string, etc., or tuples of arbitrary 
function graphs. Atomic data values are created 
by O-ary operators and represent constant 
functions. Tuple data values are created by the 
operator cons, and can be used to construct 
conceptually infinite data structures, as will be 
explained. 


Figure 1 illustrates a typical graph production 
in this language. Our convention is to use 
ellipses for primitive operators, and rectangles 
for programmer-—defined operators. The primitive 


operators car and cdr select the first and last 
components of a tuple respectively. The 
programmer—defined operator appl y~to-all 


constructs its output recursively by applying the 
function f in its first argument to every 
component of the infinite sequence x in its 


second argument. The sequence x can be 
constructed using nested pairs, e.g. cons(x,, 
cons(X5, cons(...))), where Xq, Xoee- are the 


components of x. However, owing to demand-—driven 
evaluation, only those components of the output 
that are demanded are in fact constructed. 


Figure 2 presents snapshots during the evaluation 
of the operator apply-to-all when the first 
component of the output is demanded by a car. We 
indicate the presence of a demand using. an 
asterisk and, for sake of brevity, we denote a 
sequence created by cons using angle brackets. 


apply-to-all 


Figure 1: FGL program for stream 
processing 


Demand-driven evaluation 


The execution of any operator is initiated by a 
demand on its output arc, and itS completion 


eauses the computed result to be returned to the 
source of demand. A demand on the output are of 
the graph initiates all evaluation. Demands 
propagate along arcs in the graph when arguments 
to operators are evaluated. Propagation of a 
demand along some are terminates when it reaches 


a O-ary operator, e.g. an integer, or the tuple 
creating operator. cons. A data value (a 
reference to the tuple in the case of cons) is 


then returned to the demanding node, which in 
turn propagates its computed value back. The 
evaluation of the graph is complete when the 
result at the output node is finally computed. 


An important feature of the evaluator is its 
ability to exploit asynchronous and concurrent 
evaluation of all independent operators. Two 
Operators are independent if the result computed 
by one is not needed, either directly oor 
indirectly, as an input of the other. For 
example, in figure 1, the operators car and cdr 

are independent; however, apply and car are not. 
Independent operators are the only source of all 
concurrency in this model, which occurs due to 
operators having multiple input arguments. 
Asynchronous evaluation allows independent 
operators to execute at their own speed without 
any centralized timing constraint. Thus, the 
propagation of demands, computation of values, 
and propagation of values can all_ proceed 
concurrently in different parts of the graph. 


Types of operators 

Primitive operators are evaluated in one of two 
ways: In the case of strict operators, e.g. add, 
all arguments are demanded concurrently, and the 
operator is applied only after all arguments have 
completed their evaluation. Non—strict 
operators, e.g. cond, do not require all their 
arguments to be evaluated in order to compute 
their result. Therefore, only some subset of 


their arguments is evaluated, possibly in some 
fixed order. The result of evaluation of a 
primitive operator causes the operator to be 


transformed into a O-ary operator whose function 
is the constant function corresponding to the 
computed value. 

Cons is a notable 


example of a non-strict 


operator that does not evaluate any of its 
arguments. Evaluation is done when selector 
functions, e.g. car, cdr, etc., are applied to 


extract specific components of the tuple. It is 
this property of cons that allows conceptually 
infinite sequences to be constructed, since the 


.components of cons could be function graphs that 


recursively construct tuples, and = are 
evaluated until actually needed [5]. 


never 


When a programmer—defined operator 
the corresponding graph is substituted in place 
of the operator. A demand is placed on the 
output are of the graph, which in turn initiates 
further evaluation. As a consequence, only those 
arguments of a programmer-—defined operator that 


is demanded, 


are needed to compute its final result are 
evaluated. 
In the case of both primitive and 


programmer-—defined operators, one may assume that 
after an operator has notified all sources of 
demand the availability of its result, it is 
deleted from the graph‘4 Thus, the computation 
may be visualized as a dynamically expanding and 
shrinking graph. 


Evaluation of shared subgraphs 


One of the properties of an FGL graph is that it 
may have shared subgraphs, since the output are 
of some operator in the graph may fan out into 
two or more ares. Shared subgraphs correspond to 
common subexpressions, since the result of their 
evaluation may be used as the input of more than 
one operator. Sharing of some external graph 
implicitly occurs when the input are of a graph 
fans out into two or more arcs. When two 
independent operators share a common subgraph, it 
is possible for both of them to demand the shared 
Subgraph concurrently. © In general, it is 
possible for the output node of a shared subgraph 
to be demanded concurrently along all =its 
(fanned) output ares. 


‘ The synchronization needed here is a trivial case 


(8) ty practice, however, Storage will be 


deleted only in units of entire function graphs. 
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of the synchronization needed when a_= shared 
resource is accessed: concurrent or multiple 
demands on a- shared node are _ treated by 


propagating only the first demand that arrives at 
the node, thereby avoiding re-evaluation of the 
common subexpression; the result of evaluation is 
then returned to all sources of demand. 


Further details of the features of the language 
and the demand-—driven evaluator may be obtained 


from [14]. A loosely-coupled architecture, with 
FGL as its machine language, is described 
therein. Other features of the language and some 


details of an implementation are also discussed. 


OPERATORS FOR RESOURCE CONTROL 


A resource is accessed by applying an access 
operator to a pair of arguments: the first being 
the resource itself, and the second, a tuple 
consisting of all arguments needed to perform the 
actual operation. If the actual operation 
requires no arguments, then the resource will be 
the only argument to the access operator. Unlike 
operators discussed thus far, which are purely 
applicative, access operators could result in 
side effects. AS a consequence, the behavior of 
a resource could be history sensitive. 


Figure 22) illustrates a typical use of 
resources a The programmer—defined operator 
filesystem represents an abstract file system for 
sequential files that is accessed by the set of 


operators, openfile, readnext, endoffile, 
closefile and readerror, which have the usual 
meanings. Side effects are caused by the 


operators, openfile, readnext and closefile. For 
example, openfile returns nil if it was unable to 
open the file; otherwise, it returns a reference 
to the file and, as a side effect, actually opens 
the file. Consequently, the operator filesystem 
is history sensitive. For example, the result of 
a readnext operation depends on whether” an 
openfile was first performed, and on the number 
of readnext operations that preceded it. 


Synchronization 

As indicated earlier, the synchronization of 
concurrent accesses is achieved by queues. These 
queues are created inside the resSource, and an 


access operator is synchronized by enqueueing a 
request onto the appropriate queue. We 
illustrate the use of the operator enq (defined 
below) by defining the access operator openfile 
of figure 3a: 


(b) the primitive operator seq sequences’ the 
evaluation of its arguments; the 
programmer—defined operator compute is not 


described here, but 


computation, 


is assumed to perform some 
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sel ect—queue— actual—openfile-| 
for—openfile operation 
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Figure 3b: Synchronization of the 
operator openfile 
When a reSource is accessed for the first time, 
its scheduler gets demanded and in turn proceeds 
to examine its input queues. The scheduler is 
essentially a-non-terminating iterative program 
that controls the order of removal and evaluation 


of requests on its input queues. However, the 
Scheduler may have to wait occasionally, i.e. 
when there are no requests to be served. Hence, 
Operators for waiting and dequeueing are also 
provided. We first informally define the 
primitive operators for queueing, and_ then 
illustrate their use with examples. 

Operators for queueing 

In the following definitions, q represents a 
queue created using a gqueue, and a is any 
operator or FGL expression whose evaluation is to 
be synchronized. In all our examples, a will be 
the actual access operation to be performed on 
the resource, A reference to a subgraph a is 
obtained by cons(a) » and is similar to an 
unevaluated expression (since cons does not 


evaluate its arguments). The evaluation of a is 
initiated by taking the car of such a reference. 


gqueue() creates an empty, updatable FIFO 
queue, 

enq(q, a) synchronizes the evaluation of a 
using q; the result of evaluating a 
is the value of enq. 

deq(q) the first request, a, is removed 
from q; a reference to a is the 
value of deq (a is not yet 
demanded). 

waitq(q) the demand on waitq is. satisfied 
and returns T only when q becomes 
non-empty. 

nonempty(q) Tif q is non-empty, nil otherwise. 

Mutual exclusion on all queueing operators 

accessing a particular instance of a queue is 


(C)in FGL, cons can take an arbitrary number of 
arguments, including just a single argument. 


assumed. However, the evaluation of a takes 
place outside this exclusion, and is under the 
control of the dequeueing program. It should be 
noted that accesses to different queues can occur 
independent of one another. Furthermore, any 
waiting that oecurs does not block out other 
queveing operators from accessing the queue. 


Avoidance of busy waiting 
There are three occurrences of waiting in the 
above operators: 


1. an enq operator waiting for a to be dequeued 
and evaluated, before returning its result. 


rM 


a deq operator waiting for a to be enqueued, 
before removing a. 


3. a waitq operator waiting for a to be 
enqueued, before returning T. 
In order to avoid busy waiting in the above 
eases, we must first detect two conditions to be 
true before initiating some action. For example, 
the evaluation of a requires that a be both 


enqueued as well as dequeued. In terms of 
demands, this suggests. the need to wait for two 
demands before evaluating an operator. We 


therefore introduce a special operator for this 
purpose: 


djoin(er) evaluates er only after two demands 


are received. 


Figures 4, 5 and 6 present snapshots of the 
important transitions that occur when a queue is 
accessed by the queueing operators. As before, 
asterisks indicate the presence of a demand. It 
should be remembered that the operator cons does 
not evaluate its arguments until demanded by 
selector operators, which in this case are car 
and cdr. Also, the queue created by the operator 
gqueue is updatable, and hence is referenced. 


scheduling 

Scheduling involves two main tasks: waiting for 
some subset of queues, based upon some condition, 
to become non-empty, and selecting one such 
non-empty queue for dequeueing, followed by 
evaluation. The types of information that are 
referred to in these conditions determine the 
flexibility of scheduling that is achieved. A 
partial list [2] of these types is the 
following: the type of access operator, its 
relative order of arrival, the actual arguments 
needed for the operation, the State of 
synchronization, the state of the resource, and 
the history of accesses on the resource. The 
operators .introduced here, however, are mainly 
for evaluation control. We first give informal 
definitions of their behavior, followed by 
examples of their use. 


seq(a,, oe 6 18) 
evaluates Aqreeesa sequentially; returns 
the result of evaluating an: 


par(a,,..-.a)) 
evaluates Aqeeeesan concurrently; returns 
the result of evaluating a;, aS soon as 


it is ready. 


spar(a,,...,ap) 
evaluates Aqre+-8 concurrently; returns 
the result of evaluating a,, after all 
arguments have been evaluated. 


arbit(a,, a) 
eyetuates a, and ao concurrently; returns 
nil if a, evaluates its result before ay, 
else T; 1.e. arbit favors ay in case of a 
tie. 


The operators seq and spar are strict since they 
require all their arguments to be _ evaluated 
before returning their result; however, the 
Operators par and arbit are not. The operator 
arbit is indeterminate, since its result depends 


-on the relative speed of evaluation of its 


arguments. — Except for the indeterminacy 
associated with arbit, these four operators are 
Similar to all other operators of the base 
language in that they do not require = any 
extension of the demand evaluation semantics we 
have described here. 


An example: mutual exclusion of two acceses 
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We illustrate use of the queueing operators and 
operators for evaluation control by constructing 
a resource scheduler for a simple problem: mutual 
exclusion of two types of concurrent accesses. 


Figure 7 shows a scheduler mutey that enforces 
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mutual exclusion in the evaluation of requests on 
its two input queues: p and q. The 
programmer-—defined operator mutex is essentially 
a non-terminating iterative program, although 
recursion is used to achieve this! : 


Figure 7: Mutual exclusion of two 


accesses 
The operator seq first demands cond which in turn 
causes the arbit to be demanded. Arbit then 


demands the waitgq operators on its two inputs. 


As soon aS an access is enqueued on to one of the 


(Dthis form of "tail" recursion can easily be 
detected, and hence the the storage for each 
recursive invocation may be deleted as soon as it 
has been completed. 
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queues, the corresponding waitq will return T to 
arbit. In case both queues are non-empty at the 
same time, there will be a "race" between the two 
waitqs in notifying arbit of their results. 


Arbit will return T if its left input’ was 
selected, and nil otherwise. Depending on 
whether arbit returned T or nil, cond will demand 
its second or third input argument respectively. 
This causes the selected queue to be dequeued, 
followed by evaluation of the access operator. 
Upon completion, the result of evaluation will be 
notified to cond, which in turn returns the 
result to seq. This causes the second argument 
of seq to be demanded, thereby starting another 
iteration. 


Since each iteration evaluates only one access 
operator, mutual exclusion among the = access 
operators is guaranteed. However, owing to the 
possibility of a race condition, it is possible 
for a particular queue to be ignored 
indefinitely. Hence the above scheduler does not 
guarantee fairness in serving its input queues. 
Fairness can be guaranteed by a Simple extension 


to the above scheduler, i.e., by testing the 
non—emptiness of each queue in strict 
alternation, using the operator nonempty = and 


serving a request on a queue if it is non-empty. 


Before concluding this section, we present a 
rough sketch of how the queues and the scheduler 
are encapSulated inside a resource, and how 
access operators are synchronized by them. 
Figure 8 presents snapshots of some possible 
sequence of transitions during the operation of a 
resource that uses the scheduler mutex. 


EXAMPLES OF RESOURCE CONTROL IN TEXTUAL FGL 


AL AAD 


The need for a textual representation is 
motivated by the fact that although graphs are 
useful during initial program development, and 
are suitable for representing concurrency, they 
could lead to awkward program structures. This 
is because every data dependency has to be 
explicitly indicated by an are, thereby 
complicating the physical layout of such graphs. 
On the other hand, if one were to name input arcs 
of a graph and any shared subgraphs within it, 
these data dependencies can be expressed simply 
by referring to these names. 


graph and its 
straightforward, 


The correspondence between a 
textual equivalent is quite 
hence we will not discuss the translation in 
detail. In order to illustrate this 
correspondence, we define the operator mutex in 
the textual language (see figure 9). 


The precise syntax of the language is defined in 

[15], along with examples of their use. 
However, we will explain the special features of 
the language as and when we introduce them. In 
this regard, the reader may note that the serial 


composition of unary functions, e.g. f(g(10)), 
may be written without parentheses, i.e., as f g 
10. We use the keyword, function, when a 


mathematical function is being 
otherwise, we use the keyword, procedure. 


defined; 


procedure mutex (p, q) 
begin seq(if arbit(waitq p, waitq q) 
then car deq p 
else car deq q, 
mutex(p,q)) 
end 


Figure 9: The operator mutex, 
in textual FGL 


An important feature of the textual language is 
the ability to name any expression, using the let 
clause, and to refer to these names. When a 
graph has no shared subgraphs, e.g. the graph of 
figure 1, the textual equivalent will require no 
additional names, apart from those required for 
the input ares of the graph. Even in such cases 
where graphs do not have any shared subgraphs, 
naming common subexpressions can be ae_e useful 
abbreviation, besides avoiding unnecessary 
computation. 


A simple version of the Readers and Writers 
Figure 10 shows a scheduler for a Simple version 
of the Readers and Writers problem [3], in which 
neither fairness nor any fixed priority is 
enforced. The only control enforced is the 
mutual exclusion of readers from writers, and the 
exclusion of a writer from all other readers and 
writers, Thus, readers are allowed to execute 
concurrently with one another. 


procedure readwritei(wq, rq, rr) 
let removeread be deq rq, 
read be car removeread, 
begin if arbit(waitq wq, waitq rq) 
then comment service writer; 
seq(rr, 
car deq wd, 
readwritei(wq, rq, nil)) 
else comment service reader; 
seq(removeread, 
spar(read, 


readwrite1l(wq, rq, spar(read, rr)))) 
end 


Figure 10: The readers and writers problem: 
a Simple version 


The scheduler readwrite1 has two queues rq and wq 
for read and write accesses respectively. The 
input argument rr maintains the set of running 
readers, as will be described, and is nil at the 
outermost call. 


When the queue for writers is selected by arbit, 
a write access is evaluated after ensuring all 
running readers have completed their evaluation. 
When the queue for readers is selected, the seq 
(line 10 of the program) first dequeues a read 
access without evaluating it, then causes spar 
(line 11 of the program) to evaluate the dequeued 
read access concurrently with the next iteration 
of readwritel. Thus, as long as the queue for 
readers is being selected on consecutive 
iterations, all read accesses will be evaluated 
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concurrently. 


The set of running readers, i.e. the set of 
concurrently executing read accesses, is 
maintained by the input argument rr_ and is 
constructed recurSively when consecutive’ read 
accesses are evaluated. Since the input 
arguments of a programmer—defined operator, in 
this case the operator readwrite1l, are not 


evaluated until demanded, the input argument rr 
and hence the chain of Spars, iS evaluated only 
when demanded explicitly (by seq in line 6 of the 
program). This ensures that all running readers 
have completed when a writer is about to start. 


As indicated earlier, this version of the Readers 
and Writers problem guarantees neither fairness 
nor any fixed priority in evaluating read and 
write accesses. 


The Readers and Writers problem with ‘write!’ 
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priority 


Suppose that in addition to the exclusion 
constraints of the simple version of this 
problem, it is required to give a waiting writer 


priority over waiting readers. Figure 11 shows 
the resource scheduler that achieves the desired 
scheduling requirements. 


procedure readwrite2(wq, rq, rr) 
let write be seq(rr, 
a = car deq wq, 
readwrite2(wq, rq, nil)), 
removeread be deq rq, 
read be car removeread 
begin if nonempty wa 
then comment service writer; 
write 
else if arbit(waitq wq, waitq rq) 
then comment service writer; 
write 
else comment service reader; 
seq(removeread, 
Spar(read, 
readwrite2(wq, rq, spar(read, rr)))) 
end 


Figure 11: The readers and writers problem: 
writers priority 


The basic idea is to examine the queue for 
writers at the start of each new iteration. Ef 
there is a waiting writer it will be allowed to 
execute after ensuring all running readers that 
have been previously scheduled have completed. 


Thus, as long as the queue for writers is 
non-empty, writers will have priority over 
readers. When the queue for writers becomes 
empty, waiting readers are allowed to execute 


concurrently with one another, and will continue 
to do so until the queue for writers becomes 
non-empty ‘© 
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Ce) re Should be remembered that the operator 
nonempty, unlike waitq, does not result in any 
Waiting, but merely returns the current status of 
the queue. 


In the transient situation when both queues. are 


empty, and become non-empty simultaneously, a 
reader may be selected in preference over a 
writer. If this situation occurs infinitely 
often, to consider the worst case, then readers 


and writers will be served in alternation. 


We finally present a skeletal description of a 
resource database that uses the above scheduler, 
in order to illustrate the overall structure of 
the typical resource in textual FGL (see figure 
12). The dots in the program indicate the 
absence of details. The where clause allows the 


nesting of functions and procedures, and is 
Similar to the block-structure of conventional 
languages. However, the scope of a name does not 


extend by default over all nested functions and 
procedures, but must be explicitly imported. 


resource database() 
let actualdatabase be ... 
access procedure write ... 
procedure read 
scheduler dbmanager() 
queues write: wq, 
read: rq 
begin readwrite2(wq, rq, nil) 
where procedure readwrite2(wq, rq, rr) 


end 
end 


Figure 12: Skeletal structure of a resource 


RELATED WORK 


The main thrust of previous work in data-—flow 
models (both data- and demand-driven) has been on 
determinate and so~called "value-oriented" 
computations. Efforts at handling indeterminate 
behavior and the problems of resource control 
have been few. 


We summarize some aspects of related work: 

1. Dataflow Monitors [1] are closely related 
to our approach. Their scheduling = and 
arbitration of requests is also explicit and 
user-programmable, but the underlying 
computational model is data-driven. Since 
the transfer of data between operators is 
the only means of initiating any computation 


in a data-driven model, two types of 
operators entry and exit are used _ (for 
indeterminately merging all input requests 
to a resource and for returning the results 
back to the requesting source For the 
same reason, data signals, such as 


readenable and readdone for a read 
operation, are needed within the scheduler 


rrr enema meet 


(f) ty our approach, a single operator gqueue 
performs both the entry and exit functions. The 
external demand to access the resource 
corresponds to the entry, and the return of the 
result by the resource corresponds to the exit. 
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for signalling the start and termination of 
an operation. 


The work of Friedman and Wise on Applicative 
Multiprogramming is also related [6]. An 
indeterminate constructor frons is used for 
constructing a multiset, the order of whose 
elements is determined only upon access. 
However, the synchronization of concurrent 
accesses and resource scheduling are not 
handled at the level where frons is used. 


[8] have some similarities with 
our approach. A Serializer is a high-level 
synchronization construct that has been 
developed in an Actor message-passing model 
of computation. Serializers do not require 
the aforementioned "bracketing" operations, 
and hence provide good modularity. However, 
there is a fixed underlying arbitration and 
scheduling discipline, which is perhaps less 
flexible than desirable. 


Serializers 


Sentinels [13] come closest to our approach 
to synchronization and resource scheduling, 
although the concept has been developed in 
an Algol-like language, extended with some 
tasking facilities. A Sentinel is a 
sequential process that controls the order 
of evaluation of requests on its’ input 
queues. The arbitration and scheduling of 
requests in a Sentinel is also explicit and 
user—programmable. In comparison, our 
scheduler may be thought of as a Sentinel 
that controls the order of evaluation of FGL 
expressions, 


CONCLUSIONS AND FUTURE WORK 


We have presented an approach to resource control 
that has been influenced strongly by a 
demand~driven model of resource usage and = an 
applicative style of programming. We have shown 
that some problems of resource control, such as 
the avoidance of busy waiting and scheduling, are 
solved in a more elegant manner under’ demand 
evaluation than in conventional models of 
evaluation. Furthermore, the clarity of our 
examples indicates that a demand-—driven model of 
execution is well-suited to conceptualizing 
resources and their control. 


The operators introduced here for synchronization 
and scheduling are representative of a class of 
machine primitives for resource control in 
applicative languages, and are not meant to be 
exhaustive. Although many standard problems in 
synchronization can be solved quite elegantly 
using our primitives, in general the adequacy of 
our primitives from the standpoint of efficiency 
and convenience of use might be subject to 
question. 


Heretofore, the applicative style of programming 
has not been explored as a vehicle for resource 
control in depth. Although we have introduced 
indeterminate and side-effect operators for 
arbitration and queueing, the applicative style 


actually aids well-structured use of these 
Operators, since the order of evaluation of 
arguments to a function is the only means of 


achieving any form of "control." 


In order to enhance reliability and 
well-structured use of these primitives, we are 
also developing an expression-based language, in 
the sense of Path Expressions Lids: for 
specifying the behavior of our scheduler’ [10]. 
We envisage that resource scheduling in FGL will 
eventually be programmed using such expressions, 
and a compiler will automatically translate them 


into the primitives described in this paper. For 
example, the expression, 

(p + q)* 
Specifies that the scheduler may serve any 
arbitrary sequence of p's_ and. q's. The 


translation of this expression is the scheduler, 
mutex, described earlier. 


The main advantages of such an expression-based 
language for specifying resource control in FGL 
are a) the specifications are concise, elegant 
and the notation makes good stylistic sense, b) 
the semantics of such expressions can be 
formalized in terms of FGL graphs, and c) the 
Structure of the translated programs closely 
preserve the structure of the defining 
expressions, hence’ the correctness of the 
translation may be demonstrated more easily. The 
main disadvantages are a) it is possible to write 
ambiguous specifications, and b) additional 
notation seems necessary for specifying fairness 
criteria and exclusion/priority constraints based 
on parameters to _ operations. Thus, their 
expressive power is short of being complete. 
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Figure 2: Snapshot evaluation of apply~to-all 


processfile 


readnext 


Figure 3a: FGL program for sequential file processing 
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Figure 5: Dequeveing a non-empty queue Figure 6: Dequeueing a non-empty queue, followed by evaluation 


evaluated when dequeued 
by mutex 


Figure 8: Overview of a resource operation 
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Summary 


A network computer is an MIMD computer built 
from an interconnected collection of independent, 
asynchronously executing, and loosely coupled 
processing nodes. Each node consists of at least 
one CPU attached to a local RAM. The memory of 
one node is not directly accessible by any other 
node. An example of such a machine is described in 


To 
[1] a 
structured 
eee 
hierarchica 
with entirel 
topologies. 
node. The 


proposed 


control such networks, we have 
schema 


high-level operating system 
as in Pig: as a framework for 

distributed control techniques. The 
y different physical connection 
ach blackened circle represents a 
links indicate management paths; they 
do not necessarily represent direct physical 
connections between nodes. The nodes at_ level-0 
(workers) are available for user tasks. Those at 
higher levels (managers) are responsible for 
maintainin the integrity of the local 
communications subnetwork and for performing 
resource allocation in ever larger subregions. 
Although the exact number will vary, each manager 
node can probably directly handle about 10 to 20 
subnodes. To avoid acter bottlenecks, higher 
level managers use more condensed summaries of 
allocation information than do lower level 
managers. 


: 


Fig. 


Since the control schema outlined above 
meant to be implemented in arbitrary, 
possibly even dynamically changing, 
computer topologies, an automatic 
creating hierarchies with close to 
between linked nodes is desirable. 
activity" initialization technique 
this paper partitions a network 
layers of processor clusters 
managers and their subnodes. 
hierarchy, all meaningful pe pehlaae | is initiated 
by and surrounds a single node, he FOCUS. To 
approximate heavy message load conditions, all 
Processors send meaningless local messages when 
they are not participating in. FOCUS-initiated 
activity. The FOCUS is not fixed in one location 
but rather progresses through the network 
trailing a chain of inter-foci pointers that are 
used in later phases of the initialization. 

The control hierarchy is built up from the 
leaves. To produce a hierarchy of L levels, there 
are L-1 separate but almost identical phases to 
the initialization. Each phase selects the 
higher level of managers until at last a 
level (oligarchy) is formed. The technique 
assumes that the lowest level communications 
kernel that pertatabe messages between physically 
connected (neighboring) nodes already exists in 
each node before hierarchy initialization begins. 
One arbitrarily selected node is known as the 


is 
and 
network 
procedure for 
minimal delays 
The "FOCUS of 
introduced in 
computer into 
consisting of 
To form a management 
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structure can be formed in networks > 
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SOURCE. During phase 1 it is both the first and 
last FOCUS. It is also the last node to be active 
during each phase. 

The objective of the procedure is to assign 
er at each level of a tree 
hat each pawn from a level-j 
manager to its subnodes be shorter than Rsubj 
physical links. Rsubj is computed from Rsub1 and 
N. N is supplied as a parameter; Rsub1 may either 
be supplied or estimated from N. 

The following labeled steps and 
cross-referenced example describe the FOCUS of 
activity technique in detail: 


N  subnodes per mana 
with the constraint 


A. The SOURCE broadcasts a message telling all 
nodes’ to obtain the identifiers of their 
physically connected neighbors and to start 
Sending reas messages to provide communications 
background activity. The SOURCE becomes the first 
FOCUS for phase j=i. 
B. The FOCUS sends limited (by Rsubj) broadeast 
"connect" messages to its neighbors in 
level~( j-1). 

away send 


C. Other nodes less than bbe links 
"reply" messages back to the FOCUS if they are 
not yet managers in phase j nor subnodes in any 
other phase. 

D. The FOCUS stores paths to the first N or fewer 
nodes which reply before a timeout interval 
expires. 

EK. If zero nodes reply, the previous FOCUS must 
select a new FOCUS at step K. 

F, Otherwise, the FOCUS accepts each of the N or 
fewer replying nodes into a new cluster and sends 
each a list of the identifiers of all the others. 
G. Each accepted subnode sends the FOCUS a list 


of its "connections" outside the cluster. In 
phase 1 all the physical connections are used; in 
phase j>1 only the forward and backward pointers 


to foci in phase j-1 are used. 


H. The FOCUS and the worker "nearest" each 
subnode (in a message-delay sense in the new 
cluster send one message to every other node in 
the cluster. Each worker sends the sum of the 
pepiy delay times to the FOCUS. 

ae f the FOCUS has the least delay sum it 


becomes manager for the cluster. Otherwise the 
worker with the least sum becomes both Focus and 
Manager and receives all the information about 
the cluster from the previous FOCUS. In phase 1 a 
deposed FOCUS becomes a subnode of the new FOCUS; 
in later phases it again becomes a subnode of its 
previous level-1 manager. 

J. The FOCUS lists as possible 
foci all the nodes physicall 
externally to the cluster wit the less 
frequently connected (i.e. farthest) ones first. 

- To spread groups far apart, the FOCUS finds 
the first (farthest) potentia cen porary FOCUS 
which is not yet a subnode nor a manager in this 
phase (j) by polling all the connected nodes. In 
phase 1 the temporary FOCUS is a worker. In other 

hases the temporary FOCUS selects a worker from 
its subtree. The old FOCUS stores the identifier 
of the worker. The worker becomes the next FOCUS 
at avep B. If no temporary FOCUS is left, then 
the FOCUS for the previously formed cluster in 
this phase must select the next FOCUS at step J. 
If there is no previous FOCUS in this phase then 

hase Dane complete and step L is performed. 

OCUS for a new phase j+1 must be selected 


next temporary 
connected 


- A 
by the SOURCE if more than 2N mage te were 
chosen in phase j. Starting with the SOURCE, the 


subnodes of old foci are searched until a_ worker 

is found. The worker becomes the first FOCUS of 

the new phase j+1 at step B. 

M. Otherwise the 2N or fewer managers in the 
hase j Focus chain exchange identifiers and use 
imited broadcasts to find short paths to each 

other. They form the oligarchy of the hierarchy. 

N. The SOURCE broadcasts a message to all nodes 

telling them to stop sending dummy messages. 
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Fig. 2 : 
Figure 2 shows a 4-neighbor mesh network 


with the SOURCE positioned at point S. For 
readability the communications links which attach 
each node (0) to its 4 nearest neighbors are not 
drawn. nae oy formation begins phase 1 with 
the SOURCE as the first FOCUS. In phase 1 the 
FOCUS tries to form clusters of N=9 level-0 nodes 
close to itself by sending limited extent 
(Rsub1=3) broadcast connect messages sobeps A&B). 
Level-0 nodes which receive the broadcast and are 
not yet subnodes of any manager respond to the 
FOCUS (C). n general. during phase jj, 
level-(j-1) nodes which are not subnodes of any 
manager respond. The cluster of A) nodes 
surroundin the SOURCE was formed first by the 
procedure (D). 

Once a cluster is formed, the nodes 
in that cluster determine among themselves 
which one can communicate best with all 
the others (E~I). In phase 1 all of the 
level-O0 nodes exchange messages and pass 
the total renee | times to the FOCUS. In 
phases j>1 the le 
new Cluster all tell the "nearest" level-0 


node to do this. In this way only worker oa 


eesesoe 


nodes can become level-j managers and the 
already formed lower levels of the ! 
hierarchy are undisturbed. Thus, a ? 
well-situated worker node becomes the _ ; 
FOCUS and manager of a new cluster. 4 
The node which was the FOCUS around _.-: 


which the cluster was formed is only a 
temporary FOCUS because all record that it 
was ever the FOCUS is discarded. chain 
of forward and backward pointers connects 
all of the "true" i.e. non-temporary, 
foci in the order in which mney became 1. 
managers. Since the chain is acyclic there 1: 
must always be at least one pointer to a_,,: 
level-(j-1) FOCUS outside of a newly ,,. 
formed cluster. The new manager forms a _.- 
list of all such external connections. In : 
phase-1 the list consists merely of direct 
physical connections outside the cluster. He) 
To find the next FOCUS in phase Jj, 
the current FOCUS scans the list of 


vel-(j-1) managers in the Fig. 3 
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external connections (J). If one of those .0:|avaav av 


nodes is not yet a member of any level-j ahaa 


cluster, then it is given the task or ,,. 
choosing a nearby worker to be the next 
FOCUS of phase je In phase 1 . 
level-( j-1) node is already a worker. When '”|| 
the FOCUS can find no more candidate foci *. 
the previous FOCUS in the chain assumes 25.1 


phase finishes when 1. 
cannot find an uninitialized node to be 
the next FOCUS. Since the foci form a 
tree, the FOCUS will be back at the SOURCE 
then and the next phase, if needed, can 
start (L-N). | 

Figure 3 shows an actual hierarchy formed b 
the technique during simulations with a networ 
of 27%*2 nodes. Each node is marked by a_ three 
digit cluster number to which it belongs. Level-1 
managers are circled. The number assigned to a 
manager represents its position in the FOCUS 
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chain. All worker nodes are assigned the same 
number as their manager. Level-1 clusters are 
enclosed by straight boundaries; level-2 clusters 
by curved boundaries. A 729 node mesh is small 
oe that the edges have an undue influence on 
level-2 clusters or many combinations of input 
arameters. Still, the clusters produced by the 
echnique are generally quite compact. 


The FOCUS of activity hierarchy formation 
technique has both good and bad characteristics. 
One aprons point is that the technique does not 
need to e told anything about the global 
topology of the network to produce "good" 
hierarchies. In experiments with mesh networks it 
has ecient weed een able to form hierarchies 
with average li utilizations and path densities 
only slightly higher than minimal. 

Another nice feature is that clusters of 
nodes can be made as "tight" as desired. Since 
the nodes ina cluster will tend to communicate 
often with each other during the solution of 
multiple task-multiple node problems, it is 
important that message delays between them be 
short. Based on the experiments it appears that 
compact clusters with about N members ¢an be 
achieved consistently. 

Finally, the FOCUS technique does not 
enerate oscillations in hierarchy structure nor 
oes it depend on race conditions. Although the 

technique will not produce the same _ structure 
repeatedly in a given network the hierarchies 
will all have almost identica characteristics. 
Since oscillations are guaranteed not to occur 
the technique produces a control structure in a 
short time even for a large network. The FOCUS 
technique can be regarded as an algorithm for 
producing hierarchies in network computers. 

The technique also has some _ undesirable 
features. First, it produces a hierarchy more 
slowly than paralle techniques we have 
discovered. Second, all clusters do not 
necessaril Wind u with the same number of 
nodes. Extra clusters and even  =Ilevels of 
pereneay may be formed. Lastly, there is no wa 
to predict nor control exactly which nodes wil 
cluster together. In certain situations it might 
be desirable to specify some cluster connections 
in advance. 
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DESIGN OPTIMIZATION FOR A SPECIAL-PURPOSE MULTIPLE-COMPUTER* 
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SUMMARY 


The design and performance analysis of the 
architecture of a special-purpose multiprocessor 
1S presented. The architecture is a hierarchi- 
cally structured and functionally distributed 
type. Its operating system is a multilevel 
structure implemented in an optimal combination 
of hardware, firmware, and software. This archi- 
tecture is suited to any application, such as 
process control or real-time system simulation, 
in which the basic computational tasks are dedi- 
cated and do not change in time. 


Each processor has a dedicated memory space 
in which program tasks are stored. In addition, 
there is a system bus to a global memory which is 
used primarily for communication among the pro- 
cessors. TO minimize contention for this system 
bus, selected areas of global memory are dupli- 
cated at each processor. This allows the proces- 
sor to obtain needed information by using a local 
bus rather than the global, system bus. All 
write operations to the shared memory are global 
and the information is duplicated at processors 
having shared memory at that address. Read 
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operations then become primarily local and can 
occur in parallel. 


Control functions are distributed among the 
processors; the scheduling and execution of con- 
trol and application tasks are governed at each 
processor level by a local, real-time executive. 
This executive is implemented primarily in firm- 
ware to minimize overhead. However, the control 
structure is designed to be independent of imple- 
mentation so that a variety of processors can be 
utilized together. Moreover, it is possible to 
add to each processor an additional subprocessor 
which implements the executive in hardware. 


A block diagram of the system is shown in 
Figure 1. Each processor has its own local 
memory and I/0 interfaces as required. In addi- 
tion, each processor has access to a global shared 
memory. Access to the shared-memory bus is con- 
trolled by a bus arbitration module which imple- 
ments a multiple-priority, daisy-chained struc- 
ture. Arbitration is overlapped to provide maxi- 
mum bus utilization. The control processor 
occupies the position nearest the arbitration 
module, giving it the highest priority at each 
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level. Each processor has a control port which 
is accessed by the control bus. No arbitration 
is required for this bus as only the control 
processor may act as the bus master. 


The key to successful operation of a multi- 
ple-instruction-stream, multiple-data-stream 
(MIMD) computer is effective communications among 
the processors. As discussed previously and 
shown in Figure 1, there are two system buses-- 
one for communicating data and the other for 
communicating control information--which are 
common to all of the processors. The most criti- 
cal system resources are these global buses which, 
by being shared by all of the processors, become 
the limiting factor in the overall performance of 
this multiple-computer system. It is thus crucial 
that the design and utilization of these buses be 
optimized. 


The architecture of the entire system can be 
designed to minimize bus usage. Most of the sys- 
tem control functions are distributed among the 
processors and are handled by the local executive. 
Also. because the programs to be executed are 
fixed, each processor is assigned its function in 
advance. Hence, although one processor is desig- 
nated as a control processor, it needs to communi- 
cate only a minimum of control information during 
normal system operation. This control informa- 
tion is transmitted on the control bus so as not 
to interrupt tne data fTiow on the other bus. 


One way for processors to communicate is by 
writing messages and results into a shared memory 
where other processors can access this informa- 
tion. For the MIMD system described herein, all 
of the system memory is distributed among the 
processors. Part of the memory for each proces~ 
sor is local and can be accessed only by that 
processor. This allows most run-time memory oper- 
ations to be local, thereby avoiding contention 
for the global buses. The rest of a processor's 
memory is global and available to all processors 
for memory-write operations. This global portion 
is designed in a dual-port configuration so that 
it can be read locally while being written glob- 
ally. Also, all processors can read in parallel 
without any possibilities for contention or dead- 
lock. By removing all global read operations 
from the bus, the bus traffic is reduced by much 
more than half. 


As an example of this reduction, if a param- 
eter calculated by one processor is needed by 
four other processors, a simple shared memory 
would handle this transfer in five cycles (one to 
write and four to read). With the shared memory 
duplicated at each processor, only one cycle is 
required to simultaneously write the parameter to 
all processors which need it. The destinations 
for a parameter are determined by its location in 
the memory address space. The read operations 
then occur locally and independently. 


An additional architectural feature which 
maximizes the bandwidth of the global data bus is 
synchronous operation. This reduces the overhead 
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associated with each data transfer and allows most 
data transfers to be scheduled. 


The utilization of the bus can be further 
minimized because the system is to be used for a 
Single dedicated application. The program for 
this application will be partitioned into tasks 
and assigned to processors: for execution in a way 
that minimizes the interprocessor communications. 
Also, the communications can be scheduled in ad- 
vance to minimize idle period for the bus and wait 
periods for processors, both of which add to com- 
munications overhead. Neither of thes optimiza- 
tions are readily available in a general-purpose 
MIMD system. 


For the multiple-computer system presented in 
this paper, a cycle is the time allowed to com- 
plete a write plus a read on the global shared- 
memory bus. During each cycle, a set of calcula- 
tions is also performed by the individual proces- 
sors. The physical sampling period which con- 
Sists of several cycles is a function of the sig- 
nificant highest natural frequency of the system 
being simulated. The sampling period is estab- 
lished by the control processor for all applica- 
tions processors. Because the total computation 
is performed by a repetitive sequence of cycles, 
the speed-up ratio which is a system efficiency 
measure is based on only one cycle. 


Consider a muitipie-computer system which has 
n individual processors and a total computation 
load of M tasks where a task is a self-contained 
portion of this load. The average computation 
time for one task is denoted by Tj. The average 
time for data exchange on the shared-memory bus 
per task with only global shared memory is de- 
noted by T.. The average time for data exchange 
on the shaYed-memory bus per task with both local 
and global shared memory, Tc', is given by 

Te = k Te 

where k is the local shared memory factor (O<k<1). 
A lower bound for k is 1/(n -1). 


The average processor utilization for compu- 
tation, a, is given by 


r 
M 
where Ty = the maximum time allowed for computa- 
tion. Civen the above parameters Tj, Tc', n, M, 
and o the speed-up ratio for the iit tpi es computes 
system without distributed control, Bas is given by 


+ MTA + ™M 
D C veal 

where Tp = duration of control phase. The speed- 
up ratio for the multiple-computer system with 
distributed control and with local shared memory 
By is thus given by 


Qa = 


(O0<a<1) 


By ST 
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NUMERICAL COMPUTATIONS ON CM* 


Peter G. Hibbard and Neil S. Ostlund®) 
Department of Computer Science 
Carnegie-Mellon University 
Pittsburgh, PA 15213 


Summary 


Experience related to the suitability of multiprocessors 
for larce scale computations has been mainly limited to 
synchronous SIMD machines such as Illiac IV, and array 
and piceline processors. Relatively little work has been 
performed on asynchronous MIMD machines such as 
C.mmp [1], Cm* [2], and S-1 [8]. Algorithmic 
decompositions which are suitable for such organizations 
have been studied by Kung [4] and Baudet [5] , but these 
investigations were not concerned with software 
orjanizational problems of task force management, or with 
the effects of different strategies for memory and processor 
allocation, or with the effects of different synchronization 
techniques on the performance of programs using these 
algorithms. 


For the past year we have been involved with assessing 
the suitability of Cm*-like architectures for large-scale 
scientific computations, specifically in the area of the 
approximate solution of Schrodinger’s equation for 
molecular systems, and in the area of the statistical 
mechanics of liquids. Our current efforts are directed 
towards the study of a Monte Carlo simulation of the 
properties of liquid water. This problem has a non-trivial 
computational complexity and provides an_ excellent 
vehicle for studying memory organization, communication, 
factors which affect the 
In addition, such 


synchronization, and other 
efficiency of use of a multiprocessor. 
simulations of liquids have been of great interest in recent 
years and the multiprocessor calculations can be 
compared directly with extensive related calculatinns on 
conventional! processors. 

The Metropolis Monte Carlo algorithm [6] obtains the 
properties of a macroscopic liquid by averaging over a 
large number of random microscopic configurations cf a 
collection of molecules. A microscopic 


configuration is 


individual 


ina “central box". The infinite liquid is simulated by having 
“mirror image boxes" surround the centrat:box. A new 
configuration is generated from the current configuration 
by choosing the next molecule in sequence and moving it 
to a random new position and orientation. The total 
potential energy £; of the new j-th configuration is then 
computed, using an empirical potential energy function, by 
summing over the changes in the pair interaction energies 
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represented by the positions and. 
orientations of a finite number (N) of molecules contained’ 
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of the moved molecule with the N-1 unmoved molecules. 
The new configuration is accepted or rejected according to 
a simple decision criterion [6] dependen. on the change in 
potential energy AE; in going from the i-th to the j-th 
configuration. If AE; is negative the new configuration is 
accepted; if it is positive then the accept.ince probability is 
exp{-AE,/kT}, where k is Boltzmann’s sonstant and T is 
the absolute temperature. If the new configuration is 
rejected, another configuration is generated from the 
current one, and the steps repeated; the current 
configuration is included as a member of the sampled set 
as. many times as is necessary until a new configuration is 
accepted. In this way a sequence of configurations is 
generated which sample the appropriate classical 
B Itzmann distribution, i.e., 


configuration probability « exp{-E,/kT} 


Macroscopic properties are obtained by a simple averaging 
over a sequence of O(1 08) configurations. 


The bottleneck in the serial Metropolis algorithm is the 
calculation of intermolecular interactions. For a single 
move, the time complexity is O(N) because there are N-1 
new interactions with the moved molecule. The 
bookkeeping operations of generating the move, acceptinc 
or rejecting the move, etc., are constant-time operations 
Our initial decomposition scheme, which is almost certainly 
not the optimum one, uses K processors to evaluate the 
N-1 interactions with a moved: molecule. We employ a 
master-slave relationship among the processors with the 
master processor performing the bookkeeping operations 
and the slave processors evaluating the intermolecular 
interactions. The algorithm is a synchronized lock-step 
algorithm; all slaves complete their current activity before 
new activities are assigned by the master. An 
asynchronous algorithm would be preferred provided it 
could be shown to converge to the same Boltzmann 
average-as the present synchronous one. This first attempt 
at a parallel Monte Carlo algorithm is potentially capable of 
a speedup which is linear in K the number of processors 
available. However, to.obtain linear speedup requires that 
memory contention and interprocess bus contention are 
small, and that synchronization and latency costs are 
negligible. Initial experiments show that this is far from true 
and synchronization appears to be particularily costly. 
Developing an asynchronous algorithm, which we are 
currently trying to do, is more an ex:2rcise in statistical 
mechanics than in algorithm development. 


Our initial results have been confinec to a small number 
of interacting atoms and do not use the periodic boundary 
conditions mentioned above, appropriate to an_ infinite. 
system. As a simple example, 26 atoms and 25 processors 
(the master is its own slave) leads to a speedup of 18-20 for 
unoptimized versions of the program. One of the limiting 
factors in this sample computation involves having to add 
upp the interaction energies calculated on separate 
processors. No number of processors K can reduce this 
O.N) operation to better than O(/ogyN). While the speedup 
obtained in this sample calculation appears quite 
reasonable, it is somewhat misleading. The present version 
of the program uses software double precision floating 
point routines. With double precision floating point 
hardware, the amount of local computation relative to 
global communication would be considerably reduced, and 
the speedup that could be obtained would be considerably 
smaller. One of the problems in using Cm* to investigate 
numerical algorithms is the poor floating point performance 


of the LSI-11 processors and the consequent need to 
extrapolate present results to those for a hypothetical! 


multiprocessor with a better floating point capability. From 
a detailed examination of the Metropolis algorithm, 
however, it appears that many of the difficulties with the 
present program can be solved, particularily when the 
number of molecules N increases relative to the number of 
processors K. We are relatively confident that a 
multiprocessor architecture such as that of Cm* can 
provide an efficient solution to Monte Cario calculations of 
the structure of liquids. 


The present program is written. in Bliss-11 and runs on 
the bare Cm* hardware. It is being converted to run on top 
of the Medusa [7] operating system. We are also extending 
the current programs to include molecular interactions and 
periodic boundary conditions, for a system of 256 water 
molecules using all the 50 processors of Cm*. 
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AN ORGANIZATION OF A THREE-DIMENSIONAL ACCESS MEMORY 
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SUMMARY 


Multidimensional access memory was proposed by 
[1]. In this paper a three-dimensional access 
method will be discussed in detail. 

Consider the array given by w(i,j,k) i,j € Qys 


k € Qiy, where Q = {0,1,2,...,N-1}, @, = {0,1,2, 


...,LN-1}, N = a. L = of and L < N. Let k = k' + 


k"L, k' € 9, k" € Qy, where Q, S10 cl ese sont lts 


As might be expected, L-bit is used to represent 
data in the three-dimensional N X N X N array. 

We consider two kinds of modes to access the 
array w(i,j,k). Let us call them mode 1 and mode 
2% 


mode 1 
i-slice; {w(a,j,k); ae Que for some j and k} 


j-slice; {w(i,a,k); ae Ques for some i and k} 
k'-slice; {w(i,j,o+k"/(N/L)°N); ae Qe, for 


some i, j and k" which satisfies 
k"// (N/L)=0} 
k"-skips; {w(i,j,k'toL); ae Rae» for some i, j 
and k"'} 
mode 2 
i*k'-slots; {w((ita)//N,j,Btk'"'*L); (a,8) € 
Or, x Qs for some i, j and k''} 


j*k'-slots; {w(i, (j+a)//N,B+k"*L); (0,8) € 
Quy 7, x 9,, for some i, j and k"} 


k"*k'-slots; {w(i,j,B+(k'"+ta)// NeL); (a,8) € 


° . " 
Qo ry, x Qs for some i, j and k"} 


where 02 = {0,1,2,...,N/L-1}, and / and // rep- 
N{L ‘ ; . 
resent, respectively, a quotient and a nonnegative 
remainder after integer division. Two access modes 
are illustrated in Fig. l. 
Let i, j, k andme Me be represented by a bi- 


nary representation form, i = ee eee ds 


k = (k,_1>k,_99+++9kq) and 


+9My)- Define the function @(i,j, 


J = aed gaas os 246), 


n = (m _y°™M,_o9°° 


k,m), 


L(i,j,k,m) = ( Ok _@m 


*n-1l Jn-1 n=l n-1? “n-2 
Ok, _ 5m, _noees i Oj )Sk Om 
where © represents Exclusive OR. 

A q*r shuffle Sot is defined as a mapping [2], 


Sey DD = (qi + i/r)// qr 


We define three functions as follows, 


Oj 65-9 


0 


0 <i < qr-l 


F(a) = LOS erg pz, (CECE) Hf N) 585 sey (5) 2 O// L, 
Siansi oe)? 

G(a) = LS) eg, YD Sy ey 6 Gt0/L) N),a/ L, 
Stan/i(e)? 

H(a) = LS) ent, Do Spay yD 4 2, 
Span /p CK "t0/L) ND) ae Q. 


CH1569-3/80/0000-0137$00.75 €) 1980 IEEE 


137 


Theorem 3; The inverse permutations of O,> 0. 


Theorem 13; If Os aie P»d,r € ar can be expressed 
by the description, 
O os 6) eee N-1 
(p,q5r) | L(p,q,r,0) eee L(p,q,r,N-1) 
Then o 


is the permutation on the set % 
(p,q) P w? 


and can be realized by an n-stage shuffle-exchange 
network[3]. 


The permutations required for access mode 1 
are defined as follows. 

regi! * Ca anyL) 

aaa! Gage”) 

Oak yk" OCG, kt +N/L,k"/ (N/L) « (N/L)) 
jsk'yk" ~ °C5,k'eN/L,k"/(N/L) © (N/L)) 
It can be shown that é 1 = 

i,j,k 
so = 0 < = 
Leigh” Lek ke ok jekegk 
cee k! Kt Control signals required in the realiza- 
> > 


tion of the o( 
stage. 


O 


-1 
eh eo 


re) and oO 
sky 


are m and are uniform in each 


P»q>V) 


Theorem 2; If Os. or and 0, can be expressed by 


k 
the description, respectively, 
| 0 we. Nel 
Ff) ... FU e1) 
0 eee N-1 
om 1 1 
J {e"(0) ... G (N-1) 
ait O- ~- ae. Net 
Le) io hoe) 


Then 0,, 0. and oO 
1 J 


CO. 
1 


k 


, are the permutations on Ras and 


can be realized by an n-stage shuffle-exchange 
network. 


and 
J 


sf ; 
and 0, , respectively, can be 


me k 
realized by an n-stage shuffle-exchange network. 


; 0 
’ 


denoted os ae 
L J 


It should be noted that in general 0, # 0; j 
Control signals required in 
the realization of O,, ie 


are (Z+1)-°N/L-1. 


-1 -1 
7 o, and OL # OL. 


-1 
O. and O. 


b] es Fa 
Three-dimensional access memory is physically 
implemented by N RAM chips, each organized as a 
one bit by N*L-word memory. A cell location is i 
noted by m(I,J) Te Q, J € Q.2,, where f.2, = 10, 
4 ey NE bl Oe nee ie 
WRITE operation to the three-dimensional ac- 
cess memory is performed by applying a permutation 
oO on the set of data indices and by addressing for 
each memory chip. Conversely, READ operation is 
performed by addressing for each memory chip and 


applying an inverse permutation ot on the set of 


chip numbers I. 


Storage schemes for mode 1 and mode 2 are 
given as follows. 
mode 1; A memory cell m(I,J) contains a three-di- 
mensional array entry w(f(1,J// N,(J/N)// Le (W/L), 
J/NL),J/ N,J/N) Le Qe» Je Qe * Conversely, a 


three-dimensional array entry w(i,j,k) is stored 
in a memory cell m(2(i,j,k'eN/L,k") ,j+KN) i,j,k” 
€ Qe» k' ¢€ Qs k € Qo | 
mode 2; A memory cell m(1I,J) contains a three-di- 
mensional array entry W(Sy pay EO IM N,(J/N)/ L, 


J/NL)) 5 Sy 5.47, FN), (J/N) // L+Sy pap, (J/NL) *N) I € 


One Je Qua: Conversely, a three-dimensional ar- 

ray entry w(i,j,k) is stored in a memory cell 

Sy sey pp, LCE do Soy 7 seg, OR) 9D) 85 sey pq, GD HK 

Stan /1 6&2 NL) i,j,k" e Qs k' ¢€ Qs k € Qo: 
Addresses J, permutations O and inverse permu- 


tations ont for each access mode are summarized in 
Table 1. 

Addressing circuitry is realized by Exclusive 
OR gates for mode 1. However, N/L n-bit adders are 
required for mode 2. Rewriting address J for memo- 
ry chip I in a more convenient form, we get for 


i*k'-slots, 
nt21 ,. 


J = k"// (N/L) «2 k"/ (N/L) 02 T8425 / (N/E), 
I/ L, [ith(j// (N/L),1/L,N/L-1,k"/ (N/L))] 
/(N/L) ,k"/ (N/L) }+2745 If (N/E) «2°45 / (W/L) 

. sue, a ae eae Peace = ARE, satelite 
The coetticient of Z shows that N/L n-pl1lt adders 
can determine the address. Similar arguments can 
be applied for both j*k'-slots and k"*k'-slots. 

Now, we consider a block-oriented access of 
slots, i.e., i// (N/L)=0 for i*k'-slots, j// (N/L)=0 
for j*k'-slots and k"// (N/L)=0 for k"*k'-slots. In 
this case, the coefficient of 2” is L(4/CN/L), 

T// L,i/(N/L),k"/(N/L)). Thus, addressing circuitry 
can be implemented using only AND and Exclusive OR 
gates. Moreover, control signals of the shuffle- 
exchange network in mode 2 become v and are uni- 
form in each stage as well as in mode 1. 

There exists an algorithm of data exchange in 
the memory to switch from one access mode to the 
other mode. Data exchange is performed through 
memory-to-memory data path with data permutation 
network. Necessary permutations are implemented by 
an n-stage shuffle-exchange network and a wired 
shuffle network connected in cascade. 

The algorithm may be solved in 0(3N*L) steps, 
where step is a total time of a fetch cycle time, 
a propagation time through the data permutation 


address J 


access 


mode 1 


network and a store cycle time. 
As a result, a technique for organization of a 
three-dimensional access memory is given. 
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Summary 


This paper deals with compile-time exposure of 
parallelism in high level sequential programs for 
eventual execution on a data flow computer. 
Specifically, it deals with the analysis. and 
restructuring of loops which are written for 
sequential execution, transforming them into loops 
whose iterations may be conceptually executed in 
parallel. The technique presented in this paper 
differs from other techniques in that outer loops 
are examined, even if some of the inner loops need 
to be executed sequentially. There are certain 
types of parallel computers (e.g., data flow 
computers) on which the parallel execution of 
outer loops may yield significant reduction in 
execution time even though some inner loops are 
executed sequentially. 


A data flow machine [2,2,5,6] is a highly 
parallel, asynchronous computer. The assumption 
underlying a data flow computer is that a program 
is not a sequence of instructions that cause 
changes to a memory space, but instead a program 
is a collection of computations related to each 
other by the need for data values that are 
produced and consumed. The order of execution of 
the computations is not directly stated by the 
program but rather by the partial ordering 
provided by the data dependencies. The derivation 
of this partial ordering is detailed in aie 
Therefore, transforming sequential loops to 
parallel loops, whose iterations are independent, 
may greatly reduce the execution time of a program 
executing on a data flow machine. 


Loop decomposition has 
technique 
a loop 
maintaining 


been proposed as a 
which attempts to decompose the body of 
into several smaller loops while 
the .data dependencies between the 
statements. In the loop decomposition technique 
by Lo [4], the entire loop is initially analyzed 
to see if the iterations are independent. If they 
are independent, the loop is directly transformed 
into a parallel loop. a they are not 
independent, the loop is examined to see if the 
iterations can be made independent through forward 
substitution or saving of values in a temporary 


array. If the loop cannot be transformed into a 
parallel loop using these transformations, the 
loop is then decomposed into smaller loops. Each 


of these smaller loops is analyzed to see if its 
iterations are independent or if the iterations 
can be made independent through the use of the 
above transformations. 


* Research reported herein was supported in part 
by the National Science Foundation under Grant 
NCS77-02467 


CH1569-3/80/0000-0139$00.75 € 1980 IEEE 


to 


139 


Data Flow Languages* 


Arthur E. Oldehoeft 
Department of Computer Science 
Iowa State University 
Ames, IA 50011 


Loop decomposition techniques are 
the innermost loops first. If an 
cannot be transformed into a parallel type loop, 
no attempt is made to transform the enclosing 
loops. The reason for this is parallel machines 
typically do not take advantage of the parallelism 
available in the outer loops if some of the inner 
loops are executed sequentially. In a data flow 
environment this is important because parallel 
execution of the outer loops may yield significant 
reductions in execution time even though the inner 


applied to 
inner loop 


loops are performed sequentially. For this 
reason, all loops are analyzed regardless of the 
type of statements that appear in the body of the 
loop. 

The algorithm discussed below extends in two 
ways the method introduced by Lo. First, the 
requirement that an array name appear on the left 


side of an assignment statement only once in the 
body of a loop has_ been eliminated. This 
facilitates the transformation of non 
single-assignment high level sequential languages 
to a data flow language. Second, the requirement 
that the body of the loop consist of only 
assignment statements has. also been eliminated. 
Any type of statement, including compound 
statements, may appear in the body of the loop. 


A brief general description of 
given here. 


the algorithm 


is There are two matrices associated 


with the algorithm called “order" and "try". The 
"order" matrix contains a row and a column for 
each statement in the body of the loop. Fntries 


in the “order” matrix indicate that an ordering 
relation exists between two statements in the body 
of the loop. A "t" in order(i,j) indicates that 
statement i must be executed before statement j3 
because of data dependencies or interference in 
the usage of storage caused by parallel execution 
of the loop. The data dependencies may occur 
across iterations of the loop. Fach time an entry 
appears in the "“order" matrix, a corresponding 
entry appears in the "try" matrix. An entry in 
the "try" matrix has a list of transformations 
which are applicable in breaking a cycle in which 
the two statements might appear. A cycle 
indicates that the statements in the cycle have 
data dependencies on each other. The different 
transformations used in restructuring a loop for 
parallel execution are forward substitution of 
expressions, saving values ina temporary array, 
or changing a scalar value into an array value. 
If none of these transformations are applicable in 
breaking a particular cycle, an indication of this 
is placed in the "try" matrix. Both of these 
matrices are constructed at compile time during 
the analysis of the body of the loop. 


The compile-time 
follows. All 


loop analysis proceeds as 
statements in the body of the loop 


are analyzed to determine their relationship with 
the other statements in the body of the loop. If 
any statement in the body of the loop isa 
compound statement, all the statements in the body 
of the compound statement must also be analyzed to 
determine its relation with the other statements. 
The details of the data flow analysis needed to 
analyze the data dependencies appear in ‘eae A 
statement is analyzed in the following manner. 
Every value defined by the statement is analyzed 
by finding all the uses of the value in the body 
of the loop. The uses are found in a list which 
is associated with each value defined by the 
statement. Fach use is compared with its 
definition in the loop to find its relation. If 
the definition in the loop is prior to its use in 
the same iteration, it is possible to use the 
forward substitution transformation to break a 
cycle in which the statements might be contained. 
If a use appears prior to its definition in the 
same iteration, or a previous iteration, it is 
possible to save the old values in a temporary ar 
in which the statements are involved. These facts 
are noted in the "try" matrix. If a scalar. value 


is assigned, a note is made in the "try" matrix 
indicating that the scalar value must be changed 
to an array value if the statement is to appear in 
the body of a forall construct. All values_ 
defined by a given statement are analyzed, as 
described above, and the "order" and "try" 


matrices are formed. 


Once the "order" and "try" matrices have been 
formed by the data flow analysis routine, the loop 
decomposition algorithm proceeds in the following 
manner. The "order" matrix is analyzed for 
cycles. If no cycles are found, the iterations of 
the loop are independent and the loop may be 
transformed directly into a forall construct. aM i 
there are cycles, an attempt is made to break the 
cycles using the transformations mentioned above. 
ii the attempt is successful, the loop is 
transformed into a forall construct. If the 
attempt is unsuccessful, the loop is decomposed 
into minor loops. A minor loop contains either a 
cycle or a single statement. Each minor loop that 
contains only a single statement can be 
transformed into a forall construct as long as the 
single statement does not have a data dependency 
on itself. If the single statement has a 
recursive data depencency, it must be executed as 
a sequential loop. Each minor loop which contains 
a cycle is analyzed to see if the cycle can be 
broken by the transformations noted in the “try” 
matrix. If the saving of values in a temporary 
array or forward substitution techqniues break the 
cycle, the minor loop is transformed into a forall 
construct. If not, the minor loop must be 
executed sequentially. If any scalar values are 
assigned in a loop which has been transformed to a 
forall construct, the scalar value must be changed 
to an array value. 


Consider the program segment in Figure 1 which 
multiplies two matrices, a and b, and produces a 
matrix c. Assume that the array a is 1 x n, the 
array bt is 1 x m, and the array c ism xn. 
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1 to 
aCiyj) t=, 9g) -* pay) *-el(ky 3) 
end 
end 
end 


3 


Figure 1 Matrix multiplication 


This program segment is analyzed using the 
technique given above and finds that the innermost 


loop has to be executed sequentially, but the 
outer two loops may be transformed into forall 
constructs. This is done giving the program 


segment in Figure 2. 


forall iin (1,1) do 
forall j in (1,n) do 


a(i,j) := 0 
do k = 1 tom 
a(i,j) := a(i,j) + b(i,k) * e(k,3) 
end 
end 
end 


Figure 2 Transformed matrix multiplication 


The resulting speedup of the transformed 
depends on the manner in which the forall 
construct is implemented. Assuming the index 
values in a forall construct are generated 
sequentially, it is possible for the program in 
Figure 2 to be executed in O(l+m+n) time. As it 
appears in Figure i, the data dependencies are not 


loop 


known so that the code generated results in 
0(1*m*n) execution time. 
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MICROCOMPUTER ARRAY PROCESSOR SYSTEM 
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Goodyear Aerospace Corporation 


Akron, Ohio 

Goodyear Aerospace Corporation's 
Microcomputer Array Processor System [1] 
is a programmable multiprocessor com- 
puter system designed for Electronic 
Warfare applications for the Air Force 
Avionics Laboratory (AFAL). The 
applications involved sorting, identi- 
fying and tracking emitter signals in 
real time for very dense radar environ- 
ments. The main problem in achieving 
this goal is that the signal densities 
constitute a severe data processing 
load which greatly exceeds the capa- 
bility of present airborne computer 
systems. 


The architecture of this system 
(Figure 1) retains many of the classic 
multiprocessor design concepts including 
a master-slave relationship among its 
microprocessors in a tightly coupled 
structure. Each processor is a 32-bit 
programmable computer with its own 
dedicated memory and a capability to 
execute approximately four million in- 
structions per second. Each processor 
can communicate with several banks of 
common memory (referred to as global 
memory). The global memory modules and 
their communication structure tie the 
individual processors together in a 
symmetrical multiprocessor computer 
architecture. The multiprocessor system 
is modular and can contain at least two 
and at most eight processors coupled 
with up to sixteen banks of global 
memory and executes up to 32 million 
instructions per second. Expansions 
beyond these limits are possible if 
every processor does not have to have 
access to every global memory module. 
Currently, a four processor system (with 
three banks of global memory) is in- 
stalled at Wright Patterson AFB for use 
by AFAL. This system will be expanded 
to six processors during 1980. This 
multiprocessor subsystem occupies approx- 
imately 1.6 cu. ft. and consumes under 
400 watts. 


Global memory is implemented as 
several independent memory banks to allow 
simultaneous accesses (1.e€., one con- 
current access per bank). Each memory 
bank contains at least 1024 32-bit words 
and can be accessed by each processor. 
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Global memory can be used to store 
data common to several processors, to 
Swap programs with or add subprograms to 
those in local memory of any processor 
and to facilitate communication between 
processors and/or other subsystems. 

This last use can be accomplished by 
message switching techniques and may be 
initiated via software polling or via 
hardware driver interrupts. 


Conflicts between microprocessors 
in accessing global memory are generally 
minimal in the current application for 
three reasons. First, each micro- 
processor has its own dedicated memory 
which contains its program instructions 
and local variables. Next, in the 
current application we can predict 
relative accesses due to various para- 
meters so that algorithms were chosen 
which distributed global memory accesses 
uniformly. Finally, each microprocessor 
executed many more computational in- 
structions than global memory accesses. 


GLOBAL MEMORY 


MEMORY 
REQUEST LOGIC 


eee OTckT. 
CONTROLLER CONTROLLER 
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@ 
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Architecture of the Micro= 
computer Array Processor 
System 


Figure l. 


The memory request logic connects 
the global memory banks to each 
processor (and peripheral device) in a 
manner which will support as wide a 
communication bandwidth as possible 
without requiring an unreasonable amount 
of hardware. The logic structure also 
allows expansion in the number of 
processors and/or memory banks. These 
features led to the multi-port multi-bus 
communication structure. A port struc- 
ture and a port controller are attached 
to each memory bank. Any processor (or 
device) which is to communicate with a 
given memory bank must be connected to 
a port associated with the bank. A 
microprocessor initiates a global memory 
access by issuing a request over its 
output bus. Each port determines if the 
request belongs to the address space of 
its memory bank and hence, only the 
proper port will accept the request. 
the requested memory bank is not 
currently busy the request will be 
serviced immediately. Otherwise, the 
request is held in the port. When the 
memory becomes available all requests 
held (a maximum of one per processor) are 
queued into the memory port controller 
and serviced on a priority basis. Each 
request requires approximately 200 ns to 
be serviced. It takes a minimum of 750 
ns for a processor to make a request. 
Thus, if the average rate of accessing 
for any given global memory bank is less 
than three requests per 750 ns period 
the memory access structure is trans- 
parent to the processor and no time is 
lost by the processor. 


If 


Each of the processors contains a 
CPU, program memory, a microprogram 
sequencer, a pipeline register, a condi- 
tion decoder, clock and timer and an 
interrupt control. The CPU contains 
Sixteen addressable registers and an 
arithmetic logic unit. The program 
memory contains microcode for the 
program and local data. The micro- 
program sequencer causes program memory 
to sequence through its microcode in 
proper order. The pipeline register 
holds the current instruction being 
executed so that program memory maybe 
released to fetch the next instruction. 
The condition decoder is used to 
facilitate conditional branching. The 
clocking and timing unit allows each 
instruction type to be executed at the 


fastest rate possible. The interrupt 
control allows the processor to respond 
to asynchronous external stimulus with- 
out resorting to polling. 


individual 
utiliza- 


The architecture of the 
processors is centered around 
tion of LSI bipolar bit slice technology 
rather than the MOS "computer on a chip" 
which is more commonly associated with 
the term microprocessor. Each processor 


is composed of eight 4-bit CPU chips 
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which are cascaded together to form a 

high speed 32-bit processor. The 7 
resulting processor is capable of | 
executing register-to-register opera- 

tions (i.e., adds, subtracts and 

logical operations) on 32-bit data words 

in under 300 ns. 


Bench testing of the system has 
shown that multiprocessor based systems © 
are a practical solution to the applica- 
tion problem. During bench testing two 
observations were made. First, that 
prolonged operation of this system has 
clearly demonstrated that (for this 
application at least) very high sub- 
system interaction rates can be 
supported in a cost effective manner 
through proper hardware design. The 
second observation was that for dedi- 
cated use in a master/slave mode main- 
taining control and coordination in this 
application among various processors is 
not an overly difficult task. 
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SUMMARY 
Probably the most obvious and intuitive 
property of concurrency (or parallelism) is not 


simultaneity of events. Although 
different formal models of parallelism have been 
proposed which model different aspects of 
parallelism and synchronization ({1]), 
simultaneity of events has not been directly 
represented in any of those models. Rather it 
has always been represented by interleaving 
distinct events into sequences and it has been 
analyzed by studying the properties of the set of 
all such sequences. Thus, in Petri nets we have 
"firing-sequences", which are total orderings of 
occurrences of events. In this context, two events 
a and b are "simultaneous" if xaby and xbay are 
possible sequences of events in the modelled 
system, for some sequences x and y. It is very 
simple and convenient to represent simultaneity in 
terms of sequences with implicit interleaving. 
However, as was shown by Miller and Yap ([4]), 
interleaving alone is a weaker notion than 
simultaneity and only under certain conditions 
can simultaneity be represented by a form of 
interleaving » It should also be pointed out 
that this inability of modelling simultaneity of 
events exactly is not peculiar to ordinary Petri 
nets, but is evident in all its previous 
extensions as well. 

In this work we have developed a new version 
of Petri nets, called "Timed Petri nets", which 
directly represents simultaneity of events. We are 
mainly interested in the effects of this extension 
(i.e. modelling of simultaneity) on the complexity 
of the formal properties which can be tested on 
the nets and which are useful for system analysis. 

A Timed Petri net (TPN, for short) is 
defined as a pair (PN,Tm), where : 


- PN = (P,T,I,0,Mo) is a generalized Petri net 
(see [2]) such that every transition has at 
least one input place. 

-Im ; (Zo) x T + N, where N=Zo - {0} and Zo 


is the set of nonnegative integers and n=|P| 

(the cardinality of set P). Tm is the 

"firing time function". 

This definition implies that a TPN has’ the 
same structure as a Generalized Petri net. 
However it has different flow of tokens, as we 
will see later on. 

The function Tm assigns to each transition t 
in Ta "firing time", the time interval that the 
transition t takes to fire. The firing time of 
each transition is also a function of the current 
marking of the net. 


This research was supported by INPE and CNPq 
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Let € be a nonnegative real number. We 
denote by M(t) the marking of the net at time 
instant t. Then M(T) denotes the number of 
tokens in the i'th place p, at time instant 7. As 
usual, M(t) can be represénted by a vector where 
the i'th component, M(Z)(i), is given by M(Z, p,). 
By definition, M(0) = Mo. 1 

A few attempts have been’ reported = at 
introducing time as a new parameter in Petri nets, 


but only to permit the analysis of system 
performance. We had a different objective and we 
have followed a distinct approach. Roughly 


speaking, we characterize parallelism of events by 


the firing of transitions in a TPN under two very 
intuitive notions : 
- events can occur simultaneously (or in 


parallel) only if they are independent. The 
characterization of independence among events 
(or firing of transitions) at each control 
state (or marking of a _ TPN) is crucial 
throughout the whole paper. 

- only independent events can occur 
simultaneously. Time was introduced in TPN by 


associating a firing time to each transition 
in the net only in order to support a 
primitive notion of simultaneity. 

Under that approach, a transition is said to 
be enabled at a given marking M at a time instant 
cv iff it is enabled in the usual sense defined in 
Petri nets (i.e., M(O2>I1(t)) and t is not firing 
at this time instant. However only certain 
subsets of the set E[M( 0] of enabled transitions 
at M(ct) are "simultaneously firable". This is 
defined by an independence relation on the power 
set of E[M(T)], called "Simultaneity Relation" 
Sy[M( QO]. If A and B belong to the power set of 
E[M(t)], then: 

~ (A,B) eSy iff AZB and BYA and 

M(t)3z1(t) (where téAUB). 

Sy is a symmetric and irreflexive relation. 

Using the relation Sy[M(z)] a family of sets, 
called S[M(t)], is defined such that each element 
SF, of the set S at marking M(T) is a set of 
sifiul taneously firable transitions. For every set 
SF,eS the following conditions are satisfied: 

a) |SF,|=1 or ¥B,CeSF, s.t. BEC and C¢B then 

5 i 

(B,C)eSy. 

b) ¥ B¢SF, then (SF,,B)¢Sy. 

S is a cover of the set E[M(Q] and all the 
transitions in one of these sets SF, can initiate 
their firing at the same time instant ae The 
choice of what set SF,¢€S will initiate firing at 
time instent Cis arbitrary. However, once a_ set 
is selected for firing, every transition t in 
this set will initiate its firing at time instant 
& by removing I(p,t) tokens from each input place 
P; and will be "firing" until time instant 
Tm(t)+t, when it deposits O(t,p) tokens in each 
output place p. 


Condition b in the previous definition 
implies that each set SF, of the cover 5 is a 
"naximal'' set. Thus a new set of transitions can 
be enabled only at markings defined by the 
termination of transition firings, or the initial 
marking Mo=M(0). These markings are then called 
“active markings". 

In order to specify the state of a TPN at a 
given time instant ¢, we need then to specify the 
marking M(t) and the termination times of all 
transitions which are firing at time instant ¢. 

If this is done for active markings, the behavior 
of a TPN can be completely described. 

An "instantaneous description" of a TPN at 
time instant 7 is a pair (M(v),r), where 

-~M(t) is an active marking. 

-r is a vector such that r(i) is the remaining 
firing time of transition ty> defined at time 
instant Cf. 

Under these firing rules it can be 
a TPN is able to represent non-monotone 
on the set of its markings, contrary to ordinary 
Petri nets that only represent monotone predicates 
({4]). For instance, for the TPN in figure l, if 
Py and p, have initially one token each, the net 
can be uSed to test whether or not p, has a token 
at M(O), since this fact determines the final 
marking M(2). However, if the firing rules of 
ordinary Petri nets are followed, it can be 
easily seen that this test cannot be made on 
Petri nets. 

Using the net of figure 1, it can be sho 
that TPN can simulate any 2-counters automaton 
and, therefore, Turing Machines. It is also 
possible to show that the simultaneity of events 
(or firing of transitions) is by itself a 
sufficient condition for simulation of 2-counters 
automata by TPN's. 

As TPN's retain the structure of generalized 
Petri nets, they can represent all the problems of 
parallelism and synchronization modelled by these 
simpler nets, and also directly represent the 
simultaneity of events in real systems, It can be 
shown that TPN's have also enough features to 
limit, when necessary, the number of simultaneous 
firing of transitions. In the extreme situation 
only interleaving can be allowed. In this case 
the simultaneity relation Sy is empty and each 
element of the set S, i.e., a set of 
simultaneously firable transitions, is a singleton 
set. Therefore, under our formalization of TPN, 
interleaving can be seen as a degenerate case of 
simultaneity of events. 
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The direct modelling of simultaneity of 
events by TPN's has the effect of increasing the 
complexity of the basic decision problems in TPN. 


Problems like the reachability, boundedness, 
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coverability, liveness and persistence problems 
are undecidable. Even for a persistent TPN these 
first four problems are also undecidable. These 
results can be derived by proofs similar to 

that of simulation of any 2-counters automata 

by TPN, 

As it was pointed out, TPN's are "Turing- 
complete" (in the sense defined in [1] ). 
Previous Petri nets extensions, such as Extended 
Petri nets ([1]), Priority Petri nets, C-CPM 
model and EC-CPM models ([5]) are also Turing- 
complete. However none of these petri net 
extensions can model all the characteristics of 
parallelism and synchronization, namely, 
simultaneity, reentrancy and priority. 

For example, TPN models simultaneity of 
events, but cannot represent reentrancy and 
recursivity, which need colored tokens (or 
distinguishable tokens), as shown by Zervos 
({[5])}. On the other hand, the Petri net 
extensions cited above cannot directly. represent 
simultaneity of events, since they only represent 
interleaving of events. Thus we can conclude that 
"Turing-completeness" does not express the fact 
that a given model is complete in the sense that 
it can represent all the characteristics of 
parallelism and synchronization. So far 
pacaiieiism and synchronization, in all their 
distinct aspects, have defied precise modelling. 
More understanding of the fundamental properties 
of parallelism and synchronization has to be 
developed, before we are able to perfectly 
characterize "model completeness". 
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Summary 


A new approach to computer architecture is 
suggested by functional programming (FP) systems 
(see Backus [1]). An FP system provides a ma~ 
chine language with no variable names, free of 
side effects, executable in a parallel manner 
using data flow techniques, easily translated to 
from procedural languages, and straightforward 
to implement. 


In an FP system, objects represent all data 
and data structures. These objects can represent 
any data type. FP functions can be designed for 
the functions or operations of any language. 
Functional forms (functions which use other func-~ 
tions as parameters) can model any control flow 
(procedural) aspect of a language. 


Like other data flow systems [2,5] the FP- 
based machine obtains its parallelism directly 
from natural data dependencies among operations 
in a program. 


Five items describe an FP system: a set of 
objects, a set of primitive functions, a set of 
functional forms, a set of function definitions 
called D, and the operation of application. An 
object is either an atom, a sequence of objects, 
or | ("bottom" or "undefined"). Atoms include 
numbers and identifiers. The special atom ¢ de- 
notes the empty sequence. Sequences are repre- 
sented by enclosing the sequence elements in 
< and >. In the FP system of the authors [3] 
the incomplete object is introduced. An incom- 
plete object contains portions which have yet to 
be determined, but will be filled in later if 
needed. These objects will be used to represent 
the partial result of a function which has not 
yet completed its execution. Incomplete objects 
are expressed with the incomplete atom w, the 
fundamental unit of incompleteness, capable of 
assuming any value on completion. An w can be 
viewed as a placeholder, representing the result 
of an arbitrary function which has not yet com- 
pleted execution. An w resembles a suspension 
i4], except that the function associated with 
the w is active instead of suspended. 


Each w is associated with a completion func- 
tion which will eventually specify a value to be 
used in place of the wu. Many references to a 
single w can exist and replacing an w with a 
value may alter many objects. 


an append func- 
Q, is created. 
and indicates a 
sequence which 
is like a sus- 


When an w is a sequence for 
tion, a new incomplete object, 
Q is the arbitrary subsequence 
section of an arbitrary length 
has not yet been filled in. 2 
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pended CDR except that it can occur anywhere in 
a sequence. 


Conceptually, an incomplete object represents 
a set of objects containing all possible values 
the object may assume on completion. A partial 
ordering of incomplete objects can be constructed 
using the containment operation on sets. An in- 
complete object, X, is more complete than another 
Y, if the set of objects associate with X is a 
proper subset of the set associated with Y. 


Arguments require no names since all functions 
have only one argument. Because programs are 
composed only of functions, variable names are 
eliminated. Functions which would normally re- 
quire more than one argument are applied to a 
sequence containing all arguments needed. Exam- 
ples of primitive functions are: 


apply:<f,x> Apply function f to object x. 


n3x The “th element of seq. x. 

tits Remove first element of seq. x. 
id:x Identity. Return x unchanged. 
eq:<x,y> Test if x & y are equal objects. 
reverse:x Reverse elements of seq. x. 
distr:<s,x> Seq. pairs of elements of s,x. 
length:x The length of a sequence. 
+:<x,y> Add x and y. 


apndl:<x,seq> Append x to the left of seq. 


A | is produced when a function is applied to 
an improperly formed object. All functions (but 
not necessarily forms) are preserving, return- 
ing |. when applied to 13 


Functional forms use other functions creating 
expressions involving functions. The principle 
functional forms are: 


fog:x Composition returns f:(g:x). 
[f1,...,f,]:x Construction, seq. f ,:x,f2:x... 
(p>f 3g) :x If p:x is T, f:x, else g:x. 
/f£:x Insertion of f into seq. x. 

af 3<x1,X)> Apply f to all elements of x. 


A computer based on an FP system will have 
three basic components: a set of processors, a 
memory, and a READY list. The processors apply 
functions to objects, the memory holds these 
objects, and the READY list (which may reside in 
memory) holds functions waiting to be executed. 


A list element (instruction) contains: 
<function, object, w,;, D>. 
The function and object describe an application, 
Wr (Wresyi¢) indicates the atom being completed 
(a function awaiting completion of this instruc~ 
tion), and D defines the program being executed. 
All instructions of a particular program will 


have the same D. 


The processors execute elements from the READY 
list with three possible results: if the wy is 
not referenced, the instruction can be discarded; 
the object may be insufficiertly complete for 
function execution and the instruction is attach- 
ed to the incomplete atom blocking its execution; 
the instruction can be executed, the result is 
installed into w, and all functions awaiting com- 
pletion of wy are added to the READY list. 


A processor need not be able to execute all 
functions, but can be specialized for groups of 
functions in the READY list. All intercommuni- 
cation among processors is through memory and 
the READY list. Processors have no state saved 
between instructions. 


Memory contains only objects, which include 
list elements, D's, functions, and incomplete 
atoms. Memory must be managed, allowing new ob- 
jects to be created and removing objects which 
have become garbage. Garbage must be identified 
immediately since processors need to know which 
Wy'S are unused. 


In the FP computer incomplete objects control 
execution and thus introduce parallelism. Two 
principles govern incomplete objects: all func- 
tions are completion functions associated with 


4 
nwi<r oF be oo a a ae! 


an w and the function apply will create incom-— 
plete atoms. Thus forms defined in terms of 
function application generate new w's. For exam- 
ple, fog:x expands to f:(g:x), so that a new atom 


is created to hold the result of g:x. 


Different functions require arguments of dif- 
ferent degrees of completeness. Those which ma- 
nipulate atoms, like + or -, require a complete 
object. Functions which work with the structure 
of objects often can be executed with an incom- 
plete operand. (Length:<w ,,w9> can be computed 
without values for w, or wag and 1:<w,,X> evalu- 
ates to w1-) In postponing the completion of a 
sequence, the preserving nature of the sequence 
constructor has been lost and it is natural for 
an FP system which uses incomplete objects to 
have a sequence constructor which is not L pre- 
serving. : 


The basic forms to be implemented include com- 
position, construction, apply-to-all, condition 
and insertion. Composition uses an incomplete 
atom to link the functions being composed. When 
<fos,x,wWy,D> is executed, a new incomplete atom, 
Wes is created. The function g is started by 
placing the instruction <g,x,w,,D> in the READY 
list. The function f, represented by <f£,Wt.Wy,D> 
is attached to wt. As soon as g puts a result 
into wt, the function f will attempt to proceed. 
When wr is replaced by an incomplete object, the 
execution of f and g will overlap if f is able to 
proceed. This will generally be the case when f 
and g are highly composite functions. 


The construction and apply-to~all forms are 


146 


similar in that each creates a sequence. Con- 
struction applies a variety of functions to the 
same object, while apply-to-all applies the same 
function to a variety of objects. In either case 
function evaluations can proceed in parallel due 
to the absence of side effects. Since construc- 
tion brings together multiple arguments for a 
function parallel argument evaluation results. 
When <[£1,£2,-..-5f,],x,Wy,D> is executed, <w 1,9, 
--,W,> is formed and installed into wy. Also — 
each <f4,x,@;,,D> is placed on the READY list. The 
apply-to-all form is similar except that the func 
tion will be the same and the argument will dif- 
fer for each new READY list element. 


A special case insertion can be executed in a 
parallel manner. Insertion computes a result by 
absorbing each element of a sequence into a dy- 
adic function. With associative functions (which 
can be recognized before execution) an insert- 
associative form is used. When the form </f, 
<X]>Xq90009X%qQ>,WysD>, is executed instructions 
are created to cause execution to proceed through 
abinary tree. A function to obtain this tree 
organization can be defined using parallel con-~ 
struction. 


The conditional form is of special interest in 
parallel processing. To obtain maximum parallel- 
ism, p:x, f:x, and g:x would be evaluated in 
parallel when (p>f3;g) is executed. The problem 
is that once pix is evaluated, either fix of pix 
must be discarded. When f or g are iterative 
functions (or especially recursive functions), it 
is best to wait for completion of p:x before 
starting f:x or g:x. 


For the non-parallel conditional, a new forn, 
choose, is introduced. For <(p>f3g),x,uW,;,D> the 
atom, w; is created and <p,x,w;,D>.is placed on 
the READY list. <(choose f g x),uW;,w;,D> will be 
attached to wr. When p:x is completed choose 
will be activated to select f:x or g:x. 
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summary 


Image processing has long been considered 
an important application area for parallel 
processing because of the large amounts of data 
involved and because the same operations are 
performed on every part of an image. Although 
both parallel architectures and programming 
languages have been developed for image proces- 
sing [1], they have been of the SIMD array 
variety. A large class of image processing 
algorithms, however, does not fit into an SIMD 
array format. These algorithms are highly 
parallel but asynchronous, and they have very 
different characteristics than the low level 
filtering, smoothing, and gradient operations 
that are performed on SIMD machines. This paper 
lays out the requirements for a high-level 
language for asynchronous parallel image pro- 
cessing. This work is part of a broader study 
of parallel language design being undertaken by 
the Pisces Project on Parallel and Distributed 
Processing at the University of Virginia. 


In contrast to low level image processing 
algorithms which can be expressed as parallel 
operations on every point of a two-dimensional 
array representing an image, higher level image 
processing uses a more abstract description of 
the image usually in terms of edges or regions 
(areas of approximately uniform color and tex- 
ture). These descriptions can be thought of as 
a graph where the nodes of the graph represent 
regions or edges and where the links between the 
nodes represent the connections between neigh- 
boring regions or edges in the image. There is 
a large ¢lass of asynchronous parallel algorithms 
which process such image graphs by assigning an 
identical process to each node of the graph. 

The process updates that node's description 
using information from the descriptions of 
neighboring nodes. Relaxation labeling and 
region matching for change detection are two 
examples of this class of algorithms. 


These algorithms pose particularly inter- 
esting problems for the designer of parallel 
languages since they demand a degree of process 
interaction that is intermediate between SIMD 
array languages such as ACTUS [2] or PASCALPL 
[3] and distributed processing languages such 
as CONCURRENT PASCAL [4] or Hoare's communi- 
cating sequential processes [5]. Their charac- 
teristics can be summarized as follows: 
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1) Division of identical or similar processes 
to work on different parts of a large common 
data structure such as an image graph. This 
type of parallelism is a common feature of SIMD 
array languages but quite different from con- 
current processing languages. 


2) Parallelism is at the level of procedures 
rather than individual operations as in the 
SIMD array languages. 


3) Dynamic creation and destruction of pro- 
cesses and their interconnections. Unlike the 
static monitors of CONCURRENT PASCAL or MODULA, 
parallel processes must be created as new nodes 
in the image graph are created and terminated 
as nodes are removed. Unlike the fixed 4 or 8 
connected configurations of SIMD arrays, asyn- 
chronous parallel image processes must be 
allowed to communicate with an arbitrary con- 
figuration of connected nodes in the image 
graph. 


4‘ Closely coupled processing: since a process 
on one node must frequently access the informa- 
tion describing neighboring nodes and since the 
processes are working on parts of one large 
common data structure, processor communication 
is best supported by using shared data rather 
than passing messages. 


5) Multiple simultaneous reads of shared data. 
If all processes sharing the information in a 
node's description were forced to access it in 
a strictly sequential fashion as in CONCURRENT 
PASCAL, then much of the advantage of having 
parallel processes on each node of the image 
graph would be lost. 


6) Sequential writing of shared data. If 
several processes are permitted to modify the 
values of shared data at the same time, the 
results become dependent on the speeds of the 
particular processes involved and thus are no 
longer deterministic. A process which is modi- 
fying shared data must be able to lock out ail 
other processes from either writing or reading 
the data. This problem is the familiar readers 
and writers problem which has many solutions, 
but it is so central to the parallelism in 
image processing that it needs to be solved by 
the language designer not by the applications 
programmer as in [6]. 


The design of a language for asynchronous 
parallel processing must satisfy the six cri- 
teria above. We are exploring an extension of 
the distributed process concept developed by 
Hoare [5] and Brinch Hansen [6]. An image 
graph program is defined by specifying the data 
structure of a prototype node and the proce- 
dures which can process that node. As nodes 
of an image graph are created, their corres- 
ponding procedures are activated to run 
concurrently with the processes on other nodes. 
A process running on one node can read the 
data on any neighboring node but it must call 


a procedure in the neighboring node to modify 
the neighbor's data. Synchronization of reads 
and writes is transparent to the programmer. 


An entire image graph is processed by 
specifying a sequence of actions to be applied 
to all nodes in parallel in a manner reminiscent 
of the single instruction stream of an SIMD 
array machine. The actions on a node, however, 
are independent processes which resemble a dis- 
tributed processing network. 


An architecture to support the language 
must be a multiple instruction multiple data 
stream machine which can dynamically establish 
communication between different processing 
nodes. Multimicro-processors with general inter- 
connection networks such as PASM, TRAC, and CM* 
appear to be good candidates. We feel that new 
applications in image processing will be opened 
up using the language to program such asyn- 
chronous parallel processors. 
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Summary 
Fastbus, the new bus standard which has 
been jointly developed by the U.S. high 


energy physics laboratories, presents serious 
software problems. The first requirement for 
building and maintaining a large Fastbus data 


acquisition networks is the development of 
language facilities for describing the 
hardware and software architecture of an 
arbitrary Fastbus system. This paper 

roposes two levels of network description 

anguage: a hardware Network Building 
Language (H)NBL and a software language 
(S)NBL. 


The current status of the Fastbus standard 
is well documented in [1]. It is a segmented 


bus and combines the high local bandwidth 
attainable at a segment level with global 
communication via inter~segment connections 


(SI's). Systemwide addressability is 
supported and the hardware assumes the 
responsibility of routing the data to the 
correct segment and module, be it on _ the 


local segement or on a distant segment. 


Figure 1 shows a topology typical of those 


IL 60616 
The need for a network description 
language arises from a desire to develop 
software that can be easily adapted to a 
range of system topologies. At the system 


employed in the collection and analysis of 
data emanating from a _ particle physics 
experiment. <A bank of data collection 


computers are employed to collect the massive 
amount of data coming from the sensors. It 
is then passed through a filtering network to 
a host computer where it is recorded and 
reviewed by the physicists. Besides reducing 
the data volume, the filtering computers also 
perform detection of the particle-collision 
events that are of prime interest in these 
experiments. 

the 


is similar to 


Cm* system being 


In some ways Fastbus 
architecture of the 
developed at CMU [3]. Like Cm*, Fastbus can 
eliminate the need for explicit 
protocol-based communication among the 
computers in a local computing network (LCN). 


software level functions that will have to be 
provided include system initialization, 
setting up paths between every pair of 
segments, assisting in the flow of data 
within the network, loading of software in 
different computers, and monitoring the 
status of the network. The applications 
software, on the other hand, will be written 
as a collection of parallel activities that 
can be carried out simultaneously on 
different computers in the network. In both 
cases, the software can automatically adapt 
itself to a particular environment if a 
description of the current topology is_ also 
available on the system in some suitable 
form. To enter’ this information in a 
structured and verifiable manner a network 
description language is needed. We assume 
that a typical Fastbus system will have a 
large number of segments, say about 100. 


Requirements of the language: 

Such a language, an NBL, could either be 
procedural and algorithmic or it could be 
purely declarativee We restrict ourselves to 


the former variety- For the non-programmers 
such as hardware engineers we expect to 
provide an interactive utility through which 


most of these functions can be carried out; 
in addition, the utility could provide 
display of the network in graphic form. 

NBL should be capable of describing 
hardware as well as software resources and 
current status of all the resources. The 
hardware resources include network components 
such as segments, segment interconnects, 
processors, memories and devices. At the 
hardware level, NBL should allow 
specification of how segments are connected 
and what modules are attached to each 
segment. To assist the hardware engineers as 
well as to enable software | 


Legend: 


DCC - Data Compressing Computer 
TRIG = Trigger Computer 

DC == Data Collection Computer 

FOC = Fast Data Collection Computer 
Sy ° qth seqnent on the Fastbus 


Figure 1. Model Fastbus tree structure. 
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verification, NBL should also admit the 
Physical description of the network in terms 
of crates and occupation of slots by hardware 
modules. The software facilities of NBL 
should minimally allow one to specify loading 


of different software modules into different 
parts of the network and establish data paths 


among them. At the software level NBL could 
be thought of as a job control language for a 
distributed system. 


Hardware language: 
The example hardware program shows a 
description of the system pictured below: 


10 SEGMENT TOP 
20 FOR I= 1 TO 4 
30 SLOT I HAS TOPSI(I) =: SI 
40 NEXT I 
50 SLOT 5 BAS HOST : PDP-11 
60 SLOT 6 HAS OUTPUT : PDP-11] 
70 SLOT 7 HAS OUTPUTBUFS : MEM 64K 
80 END SEGMENT TOP 
90 FOR I = 1 TO 4 
100 SEGMENT INTERMED (1) : 
110 SLOT 1 BAS TOPSI(I) : SI 
120 FOR J = 1 TO 20 
130 SLOT J+2 BAS MEDSI(I,J) : SI 
140 SEGMENT LEAF (I,J) 


150 SLOT 1 BAS MEDSI(I,J) : SI 

160 FOR K = 1 TO 20 

170 SLOT K+l HAS DC(I,J,K) : STP 
180 NEXT K 

190 SLOT 22 HAS PDC(I,J) : STP 

200 IF I <e 4 OR J <= 20 

210 SLOT 24 HAS LEAFSI(I,J) : SI 
220 ENDIF 

230 IFJ>1 

240 SLOT 23 BAS LEAPSI(I,J-1) : SI 
250 ELSE IF I >1 

260 SLOT 23 HAS LEAFSI(I-1,20) : SI 
270 ENDIF 


280 END SEGMENT LEAF (I,J) 

290 NEXT J 

300 SLOT 23 HAS INTERMEDBUPS(I) : MEM 256K 
310 SLOT 24 HAS DCC(I) : VAX 

320 SLOT 25 HAS TRIG(I) : VAX 

330 END SEGMENT INTERMED (I) 

340 NEXT I 

350 REMOVE LEAFSI (4,2) 

360 REMOVE DC(2,20,9) 

370 END 


D 


There are three tynes of segments. There is 
one TOP segment with segment interconnects at 
geographic addresses one through four, a host 
computer in position 5, an output computer in 
position 6, and a memory module in position 
7. There are four INTERMED segments with a 


SI to the TOP segment in position one, SIs to 


LEAF segments in positions 3 through 22, a 
memory in position 23, and two computers in 
positions 24 and 25. There are 80 LEAF 


segments with a SI to an INTERMED segment in 
position one, specialized data collection 
computers in poSitions 2 through 22, and SIs 
to preceeding and succeeding LEAF segments in 


positions 23 and 24. Two devices are 
presumably out of comission since they are 
REMOVEd at the end of the specification. 
Software language: 

The software description language must be 
able to load the devices mentioned in the 
hardware description language program, 
parameterize the various instances of the 
same program by initializing global 
variables - especially with addresses of 
other hardware or software objects on the 
system, allocate and initialize some standard 
software objects - such as buffers, and make 
the initialization conditional upon the 
existence of neighbouring devices (so that 
the system can cope with not all devices 
working all the time). 

The example software program fragment 
given below is to load a data collection 
program into the hardware system described 
above. Each DC computer is loaded with a 
COLLECT program, is given a buffer for its 
data, and is given the address of its 


preceeding and succeeding DC 


the neighbours can be told to read out 
sensors near "hits" )-. 
The present work is only a part of a much 


computers (so 


network software project under way at 

Institute of Technology with 

technical and financial assistance 
The hardware version of NBL 
is being implemented; the translator 
generates a data base for the network which 
1s designed to support initial program load, 
user program design and deployment, system 
maintenance, fault detection, and 
reconfiguration [4]. 


larger 

Iilinois 
generous 
from Fermilab. 


10 LET FDCBUFPSIZE=600 

20 LET DCBUFSIZE=600 

30 LET OUTBUPS1ZE=16380 

40 PORI #1 0 4 

50 FOR J=l TO 20 

60 LOAD PPC(I,J) 

70 LO FPASTCOLLECT 

80 INIT PARENT = FDCBUF(J) IN INTERMEDBUFS(1I) 
90 INIT PARENTSIZE = FDCBUFSIZE 
100 END LOAD FDC(I,J) 

110 NEXT J 


150 FOR K = 1 TO 20 
160 LOAD DC(1,J,K) 


170 LO COLLECT 
160 INIT PARENT = DCBUF(J,R) IN INTERMEDBUFS(I) 
190 INIT PARENTSIZE = DCBUFSIZE 

200 LET Kl = K+ 1 

210 LET Ji a J 

220 LET Il = I 

230 IF Kl > 20 

240 LET Kl = 1 

250 LET Jl = Jl +1 

260 END IF 

270 IF Jl > 20 

280 LET Jl = 1 

290 LET Il = Il +1 

300 END IF 

310 IF DC(11,J1,K1) EXISTS 

320 INIT SUCC = CSR_PREDINPUT IN DC(I1,J1,K1) 
330 INIT SUCCDEFD = 1 

340 ELSE 

350 INIT SUCC = NULL 

360 INIT SUCCDEFD = 0 

370 ENDIF 


« Similarly for predecessors 


560 END LOAD DC(I,J,K) 

570 NEXT K 

580 NEXT J 

590 LOAD DCC(I) 

600 LO DSQUEEZ 

610 INIT DCBUFS = DCBUF IN INTERMEDBUFS(I) 
620 INIT DCBUFDELEMSZ=DCBUFSIZE 

630 INIT DCBUPSDIM]=20 

640 INIT DCBUFSDIM2=20 

650 INIT PARENT = BUF(I) IN OUTPUTBUFS 

660 INIT PARENTSIZE=OUTBUFSIZE 

670 END LOAD 

680 LOAD TRIG(I) 

690 LO TRIGGER 

700 INIT TRIGDATA = PDCBUF IN INTERMEDBUFS(I) 
720 INIT TRIGELEMS1ZE=PDCBUFSIZE 

720 INIT TRIGDATADIM=20 

730 IF TRIG(I+1) EXISTS 

740 INIT NEXTTRIG=CSR_PREDINPUT IN TRIG(I+1) 
750 INIT NEXTRIGEXISTS=1 


770 INIT NEXTTRIG=NULL 
780 INIT NEXTTRIGEXISTS=O 
790 ENDIF 


. similarly for TRIG(I-1) 


870 END LOAD 

880 LOAD INTERMEDBUFS (I) 
890  ALLOC DCBUF (20,20) 
900 ALLOC FDCBUF (20) 
910 END LOAD 

920 NEXT I 

930 LOAD OUTPUTBUFS 
940 ALLOC BUF(4) : BUFFER(OUTBUFSIZE) 
950 END LOAD 

960 END 


: BUFFER(DCBUFSIZE) 


: BUFFER (FDCBUFSIZE) 
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VSP: BUILDING BLOCKS FOR PARALLEL PROCESSORS 
William S. Dowey 
Gould Inc., Chesapeake Instrument Division 
6711 Baymeadow Drive, Glen Burnie, MD 21061 


This paper addresses the Vector Scalar Processor 
(VSP) solution to the problem of proliferation of high 
cost specialized processors for computationally large 
algorithms. As the title implies, the processors can be 
configured in a variety of distributed parallel processor 
formats. The Vector Processor (VP) is the title given to 
the processor block responsible for repeated sum of 
product operations. The Scalar Processor (SP) is the 
title given to the processor block responsible for 
communications, scheduling, and data storing/retrieval. 
This VSP solution is unique in that the hardware (based 
on the AM 2900 family) in the blocks is interchangeable 
within and between VP and SP processors. This inter- 
changeability trades off design complexity for multiple 
low cost processor blocks which achieve computational 
requirements. 
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(AM 2910) 
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Figure 1. Controller for Vector and Scalar Processors 
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Both the SP and the VP use a similar controller 
configuration centered around an AM 2910 sequencer 
(Figure 1). This sequencer has four modes of contro! for 
microcode address selection. An external condition code 
is used primarily for the data dependent operations of 
the SP; the Internal Counter equal to zero is used for the 
structure (algorithm) dependent VP operations. The 
controller block is completed using a horizontal! field 
pipeline register to allow for simultaneous action of all 
functional blocks in the data paths. The pipeline ROM 
which contains the microcode instructions removes the 
combinational logic from the design process. This logic 
is replaced with programmable fields which determine 
the state of the functional blocks for each clock period. 
Both SPs and VPs have individual clock units, each VP 
clock being slave to an SP clock. Both clocks are 
capable of outputing instruction selectable periods (from 
60-480 nano seconds in 30 nanosecond increments). This 
allows matching propagation delay paths to execution 
times instead of restricting execution times to the 
longest propagation paths for all instructions. 


The Scalar Processor shown in Figure 2 features a 
Von Neumann type machine modified to incorporate AM 
2901 based next address generator (NAG) for retrieval of 
instructions and data from a common (macro) memory. 
With the horizontal! microcode, ALU operations can be 
taking place on data while the next address generator is 
fetching data or the next instruction word. The SP 
supports both the AN-UYK-20 assembly language in- 
struction and special microcode instructions (beyond the 
standard instruction set) to achieve operational speed 
increases for a Direct Memory Access (DMA) data 
handling operations. The Scalar Processor communicates 
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| TRANSFER 
| REGISTER 
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[SP SERIAL | 
| INTERFACE 
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Figure 2. Functional Blocks of the Vector and Scalar Processor 
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with multiple VPs through a VP data bus and receives 
status from the VPs in a serial fashion. Communications 
with other SPs also takes place over other serial lines. 


The VP shown in block form in Figure 2 features a 
controller which performs repeated operations on data 
with a taper-on and taper-off of data flow required from 
the data memories. Arranged this way, the VP gains 
efficiency as the number of identical operations exceeds 
8. The Multiplier is interchangeable with a multiply 
accumulate, allowing for faster execution of sum of 
product algorithms. If the complexity or throughput 
requirements of the algorithm do not allow for in-place 
computations, SPs are used to collect partial processed 
data, reorder, and pass it on to the next level of VP 
processing. 


The three VSP configurations below illustrate the 
modularity of the building blocks. The first configura- 
tion was the first pilot VSP development. It implements 
an FIR Interpolation filter of 1 million operations per 
second (MOPS) per VP. This figure is exclusive of VP 
overhead operations. This filter configuration, shown in 
Figure 3, is a computationally distributed network. It 
features two parallel paths comprised of two SP and VP 
elements. Here both halves of the parallel network 
contain duplicate microcode, so that each side is capable 
of processing either half of the incoming data. The 
second SP in each half parallel leg collects results and 
performs half aperture broadside beamforming on the VP 
result data. While it would be possible to link the VP 
parallel branches in a Single Instruction Multiple Data 
(SIMD) fashion, the necessity of half aperture operation 
overruled this approach. Similarly, the two legs of each 
VP could be combined under one controller, but the 
interchange of data at the SP was initially thought to 
preclude this. 


SKEWED | 


GROUP FORMATING 
& BROADSIDE BEAMFORMING. 


DESKEWING 


Figure 3. Interpolation Filter of 4 MOPS in 2 Parallel 
Paths 


A Widrow Filter application shown in Figure 4 is in 
development. It utilizes one SP and 8 parallel VPs 
composed of 2 parallel arithmetic elements (AEs).. The 
VPs, are under the control of 1 sequencer, in SIMD 
fashion. Each AE calculates results and interchanges 
these results with its paired AE, checking for computa- 
tional inaccuracies. In executing the Widrow filter, 
there are 5.2 MOPS/AE, which yield a total of 84 MOPS 
of SIMD instruction. In this configuration, the 24 bit 
coefficients are updated each cycle and rounded to 12 
bits for use in the filter applications. 
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Figure 4. SIMD Control Achieves 84 MOPS in 8 VP 
Elements 


The geometric processor is an embed processor in a 
simulator for image processing calculations. The 
computation rate is 9 MOPS for 6000 edges. It is 
configured as shown in Figure 5, with VP. and VP 
sharing a common controller and executing “in a SIMB 


fashion. | 
HOST SCENE 
AND OBJECT DATA 


| TO DISPLAY 
GENERATOR 


Figure 5. VP., VP3 are SIMD Units for 80% of 
Computational Time. 


The Filter configurations, shown above, are in the 
process of being built with Standard Electronic Modules 
(SEM) of the U.S. Navy. All the control sequencer block 
is available in this format. All but one of the data path 
cards (the AM 2903 dual port accumulator/ALU) are 
available as standard modules. The GP is being 
developed on 5 card types, from which both SP and VP 
modules can be configured. The five distinct card types 
(Controller, ROM, RAM, ALU, Multiplier) are each based 
on the AM 2900 family components. 


The Building Blocks processor approach to dis- 
tributed parallel processing is a viable concept, as the 
variety of applications cited here demonstrate. Each of 
the applications uses different algorithms, yet the same 
approach of stacking the processors has been used to 
achieve the desired computational throughput from the 
basic 6 MOPS for the VP to the 84 MOPS for the 8 
parallel dual VPs in the Widrow application. The VSP 
architecture allows simplicity and reliability of design -a 
cardinal trait of effective building blocks. 


A NEW GENERAL—PURPOSE DISTRIBUTED 
MULTIPROCESSOR SYSTEM STRUCTURE 


Jin Lan 
Department of Computer Engineering and Science 
Tsinghua University 
Peking, The People's Republic of China 


Summary 


One of the basic problems of organizing 
a multiprocessor is to develop a good sys- 
tem structure, which, from the author's 
point of view, should be modular, reconfi- 
gurable, partitionable as well as tightly- 
coupled,.These properties are necessary for 
forming a general-purpose system, in which 
multitasking and multiprogramming can be 
combined together to enhance the overall 
system efficiency. As a possible solution 
of this problem,a distributed multiproces-— 
sor system structure was proposed in this 
paper. 


It is noted that among the great varie- 
ties of parallel and multiprocessor struc- 
tures bus (1}-(4] and array(5)-(8] are the 
widely-—accepted approaches to solving in- 
terconnection problem in many operational 
systems, The proposed scheme tends to com- 
bine these two structures to form a new 
one, taking advantages of simplicity and 
possibility of using commercially avail- 
able models of processors from the side of 
bus, and regularity and adaptivity to lar- 
ge~-scale systems and high processing power 
from the side of array. But the new struc- 
ture has its own special features, not 
belonging to its “predecessors”, Its main 
difference from the bus structure is the 
eoncept of “splitted bus", realized by in- 
troducing switches for routing control and 
elimination of bus contention, The basic 
structure resembles a mesh connection, but 
it is more flexible and has. shorter mes- 
sage paths than the simple array. These 
considerations lead to the structure shown 
in Fig.1 for n*n processors, The circles 
denote the three-pole bidirectional swi- 
tches, and the solid dots denote the pro- 
cessor—-nodes.This structure can be redrawn 
symbolically in Fig.2, where lines, cross~ 
points and shaded triangles represent bus- 
segments, processor-nodes and switches 
respectively. 
connected to three processors at its 
corners. This notation helps in revealing 
some advantages of the structure owing to 
the existence of connecting paths along 
the diagonal directions of the array. This 
fact causes a significant reduction of 
message transfer distance between proces- 
sors, maximum length of which for an nen 
array equals to 


i Qn — 2n(modulo 3) 
max = i” ie e 
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Each switch has three poles 


Fig.1 Two-dimensional system structure 
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Fig.2 Graphical representation 
of a planar array structure 


It means that n = 2k +1 for kK = 1, 2,5 we. 
will be the optimum array size, In compa- 
rigon with other structures with message 
distance O(logN), this structure may take 
some benefit for moderate systems with 
total number of processors not exceeding 
N= 100,, because in this case dnpay = 6, 
while 26 = 64 < 100, 


Another main characteristic of the pro- 
posed structure is the ability of dynamic 
reconfiguration and programmable parti- 
tioning of the system. Two examples are 
shown in Fig.3, in which the 4#4 array can 
be transformed into a linear array or 
ring, or two separate trees with 7 and 9 
processors respectively. Another example 
of transforming a 6*6 array into a 5-level 
binary tree is shown in Fig.4. 


Double tree 


linear array (ring) 


Fig.5 Examples of reconfigured systems 
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Fig.4 Reconfiguration into a five-—level 
binary tree 


A further improvement of the structure 
can be achieved by using four-—pole bidire- 
ctional switches with the symbol shown in 
Fig.5. The number of different states of 
the switch increases to 15, giving a great 
flexibility of interconnections, The four 
poles of it can be imagined to form three 
planes perpendicular to each other, This 
provides convenience in organizing a three 
dimensional cube structure. Every one of 
the n cross-sections parallel to any sur- 
face of this cube contains n*n processors, 
forming just an array like that shown in 
Fig.1. This makes the structure useful for 
organizing an MSIMD system with total num- 
ber of processors N = n##3, so that the 
message distance. between any two nodes is 
reduced to 0(2/P) , and the optimum size of 
the system is extended to N = 1000. 


Still another way of using the fourth 
pole of the switch may be to add more mes- 
sage paths to the original array of Fig.1. 
One of the modified arrays thus obtained 
is shown in Fig.6. For clarity, only a 
small part of the additional paths are 
represented: the 6 paths from one switch 
(by heavy lines), and the paths connecting 
one processor to its 12 neighbours, 
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Fig.5 Symbol of a 4-pole 
bidirectional switch 
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A MULTI-MICROCOMPUTER ARCHITECTURE 
FOR AN ITERATIVE ALGORITHM 


Dan I. Moldovan 
Department of Electrical Engineering 
Colorado State University 
Fort Collins, CO 80523 


Summary 


This paper analyzes the inherent parallelism 
in computation of some recursive algorithms. The 
class of functions considered present at least 
two levels of parallelism. A multi-computer arch- 
itecture is proposed for this class of problems. 
This architecture can be easily implemented with 
microcomputers, and a high degree of modularity 
may be achieved. The operation of such architec- 
ture, including the computer communication and 
timing was studied on a simulated model. 


Consider the following recursive vector 
function 

x(k+1) = f[x(k),x(k-1),...,x(k-m+1)] (1) 
with x(k) € R” and k € {0,1,...,K}. The vector 
x(k+1) depends of its previous m values. The 
iterative nature of the above expression derives 
from the fact that the computational procedure 
repeats when k is incremented. Equation (1) can 
represent a system of difference equations de- 
scribing the behavior of some dynamic systems 
commonly seen in control theory and signal pro- 
cessing. Sometimes, logic equations used in the 
design of digital systems are put in the form of 
expression (1). 


The first level of parallelism in computing 
(1) is achieved when all components of vector 
x(k+1) are computed simultaneously. Next, let 
us assume that each component can be written as 

x, (k+1) = SU CUA ames Ga a (2) 

where $;, = 45 [%(k) ,x(k-1),...,x(k-m1)]. Notice 
that all $4, can be computed independently and 
simultaneously if they are assigned to different 
computers, provided that vector x is available. 
This represents a second level of parallelism 
which can be exploited. While further levels 
might be possible, we consider only the first two 
levels, and this is sufficient of many applica- 
tions. 


Parallel processing is oftenly motivated by 
the desire to increase the speed of computation. 
Thus, especially under real-time conditions paral- 
lel processing might be the only solution to com- 
plex numerical problems. Some recent micropro- 
cessors, with their relatively low cost and high 
computing power open new possibilities to imple- 
ment powerful multiprocessor systems. 


One possible multi-computer architecture for 
computing x(k+1) is shown in Figure l. 
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Bus 
Controller 


Figure | 


Each function $;. is assigned to one microcompu- 
ter UCh4- The functions f; are computed on 
microcomputers ucy. The vertical buses By are 
used to transfer data between uc; and UCe45: 20 
both directions. The horizontal bus B is used 
to transfer the newly computed vector components 
x; (k+1) from their source to other computers 
UC5+ Each microcomputer consists of a micropro- 
cessor, memory, I/0 and control logic. Because 
of the iterative nature of the problem under 
consideration, the memory of each computer is 
relatively small. 


Several operations of this computing struc- 
ture are possible. We choose to operate each 
computer independently of others and run 
different clocks. However, the computer 
not start its processing tasks until the required 
data has arrived. This is considered to be syn- 
chronized operation because all necessary compu- 
tations are performed within one iteration. The 
synchronized operation is preferred here over 
asynchronous operation because we want to main- 
tain a high degree of accuracy in computations. 


The transmission of data from source to 
destination is conditioned by the availability 
of the respective bus and the readiness of the 
destination. In the model used, a computer 
cannot be interrupted to receive data while 
processing. 


For our convenience, we partition the activ- 
ities involved for one iteration in processes. 
The following types of processes take place for 
each iteration. 


Pl. Transfer components of vector x(k) from uc, 
to "czy, as needed. 


“P2. Compute aj On UCi;- 


P3. Transfer the result $64; from ucy4 to ucj. 
4. C +1) = ; 
P4. Compute x, k 1) f, on uc, 


P5. Transfer x, (k+l) from source to other micro- 
computers on B bus, as needed. . 


P6. Next iteration, k «+ k+l, update variables. 


Our goal is to perform simultaneously as many pro- 
cesses as possible. 


The study of such architecture was done ona 
simulated model. The first step is to map the 
mathematical problem of form (1) into an archi- 
tecture of the type shown in Figure 1. Since no 
formal procedure for this mapping was established, 
and hence is not unique, the aim of the simula- 
tion is to estimate the system performances and 
to identify ways of improving them. We are not 
interested too much in simulating the execution 
of a program on microcomputers, instead, we simu- 
late the data flow between computers and the op- 
erational strategy. A simulated real-time clock 
marks the timing events. For each time unit the 
program scans all the microcomputers, determines 
their status and initiates or terminates activi- 
ties. Various processing times and data trans- 
fers between computers, dictated by the mathemat-— 
ical problem, are stored in a set of matrices. 


The output of the simulation program pro- 
vides the number of time units required for one |. 
iteration, the structure's speedup factor and its 
efficiency, the utility factors for all microcom- 
puters and the number of bus contentions. An 
analysis of such output data allows us to "tune" 
the architecture according with the mathematical 
problem to achieve the desired performances. 


Example: Consider the following problem. 


: ch es 
uae a a19 Py (k) | 

| 8 
1, (ict), : Ay, Ano X, (k) 
Pig: Big 1, (k) *x, (e-1) | - 
: i 
boy by | Jy 0) "aty (kL) 


First, partition the problem such that only 6 
computers are used. One possible way is: 


by 7 ayy XS (k) +b, Xk) X, (e-1) 
bi = App Ky (k) + by Ky(k) X, (ke) 
boy = Ay, Kye) + by, K(k) X, (ke) 
bo = Any K5(k) + by Xy(k) K, (kL) 
Ey = 11 + Oy and fy = 65) + by9 


Based on this partitioning of the mathematical 
problem and knowing the characteristics of the 


microcomputers used, one can estimate the execu- 
tion time for each function. It is feasible to 
consider fluctuations in the processing speed of 
these functions for different iterations. A uni- 
form distribution around a selected mean was 
assumed in our simulation. For Intel 8086 based 
microcomputers, it is estimated that t;. = 210 
time units, t; = 10 time units and tranSmission 4 
between two adjacent computers is 5 time units. 
One time unit corresponds to approximately 5 clock 
cycles. 


For this input data the simulation program 
provided the outputs indicated in the first col- 
umn of Table 1. It can be seen that uc. and UCo 
are underutilized when compared with the rest. 
Next, we want to further decrease the computation 
time for problem (3) by introducing more 
computers. 


Speed up factor 1420 
tfficiency Os 72 
gee utility 0.89 | 
C15 utility 0.93 
Mey utility 0.97 
Mey, utility 0.82 
cos utility 0.87 
UC 55 utility 0.93 
#03 utility 0.86 
[HC 5, utility 0.90 
uc, utility 0.49 
uC, utility 0.46 
125 Cees 


j{teration time 
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This is done by assigning 4 computers to compute 

f, or £, instead of only two as used previously. 
The results are summarized in the second column 

of Table 1. Notice that the computation time per 
iteration is reduced by almost half and the utili- 
ty factors for uc, and uc, improved while the 
others still remained high. 


PARALLEL NONLINEAR MINIMIZATION BY CONJUGATE DIRECTIONS 


Efthymios C. Housos and Omar Wing 
Department of Electrical Engineering 
Columbia University 


New York, NY 


summary 


In the development of parallel algorithms for 
minimization new requirements, such as minimi- 
zation of the communication time and the exchange 
of data among processors, become as important as 
the classical requirements of good convergence and 
numerical robustness. In this paper algorithms 
suitable for the solution of the unconstrained 
minimization problem on a parallel computer are 
presented. The algorithms involve the parallel 
execution of linear searches along conjugate 
directions. The basic assumptions about the 
parallel computer are the following: 


1) Every processor is able to exchange informa- 
tion with every other processor. 


2) The communication time is important and 
should be minimized. 


The algorithms developed are of the conjugate 
direction type but are different than those re- 
ported in i}. 

The importance of conjugate directions for 
the solution of the unconstrained minimization 
problem has been realized by many researchers [5,6]. 
The conjugate direction algorithms are based on 
the conjugacy properties of a set of vectors with 
respect to a certain. matrix. Namely, a set D = 
{d., i=l,...,n} consists of conjugate vectors with 
respect to a matrix H, if and only if 


(d., Hd.) =0 for 
Jos] for 


Assuming that there is some way of finding a set 
of conjugate directions given a matrix H and 
using theorem 1 below, the solution of the problem 


min J (x) xeR (1) 


= 
where J(x) is quadratic, could be found in one 
parallel step involving a linear search along each 
of the directions. 


Theorem 1 


The minimum of a quadratic function, 3 (x)=x* Hx t+ 
btx + c, can be found by searching through a set 
D={d;,i=1,...,n} of conjugate directions with 
respect to H once and only once in any order. 


This theorem implies that if a set of n con- 
jugate directions with respect to H and a set of 
n processors were available then the solution of 
(1) could be found in one major parallel step and 
an additional step that involves the addition of 
the local minima. Of course, this is only true if 
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J(x) is quadratic. If J(x) is not quadratic the 
above theorem would suggest finding a set of con- 
jugate vectors with respect to the Hessian of 
J(x). Thus the major problem becomes one of find- 
ing a set of conjugate vectors in parallel. That 
is, develop methods of producing conjugate vectors, 
with respect to a matrix, which are amenable to 
parallel computation. The difference between the 
serial and the parallel algorithms comes from the 
fact that for parallel computation it is necessary 
to estimate all the conjugate directions before 
any linear searches are performed. Once this is 
achieved, all the linear searches may be performed 
in parallel and, hence, a large part of the total 
computation time is thus parallelized. This is 
because a linear search usually involves 3-5 
function evaluations which can be time consuming 
for a fairly complex objective function. The 
parallel execution of linear search procedures 
also insures that the utilization of the parallel 
computer will be high because the communication 
time (time for the exchange of information among 
processors) will be a small fraction of the total 
computation time. It has been shown that for the 
class of parallel computers such as the SIEMENS 
SMS 201 the ratio of the communication time to the 
actual computation time is the most critical fac- 
tor in achieving a reasonable "speed-up" [2 : 


The Gram-Schmidt method could be used for 
finding a set of conjugate vectors with respect 
to a matrix but this method is both computational- 
ly inefficient and not readily parallelizable. 
For these reasons it would be desirable to have 
methods of finding conjugate or semi-conjugate 
directions which are amenable to parallel compu- 
tation. An algorithm for the solution of (1) 
based on two theorems proved by M.J.D. Powell [3] 
will be presented next. Details about the algori- 
thm and computational experience in solving power 
system problems using this algorithm can be found 
in [4]. 


Algorithm 


Choose a set of orthogonal vectors as the 
initial search vectors. Let these vectors be 
d., i=l,...,n, where n is the dimension of the 


problem. Choose an initial point x? and calcu- 
late J(x®). 
Step 1. Find the n one dimensional minima along 


the n search vectors that is, 


; 0 " 0 
— Jet + Ads) = Sx" + Aya) 
where de is the optimum steplength. Let 


a. 1, atatary Th 


0 | 
x, =x + Aids (2) 


STOP if a solution has been found. This step can 
be implemented in paraliel using up to n pro- 
cessors. 


Step 2. Set 


a0) 
Sed (3) 


or calculate 
‘ n ; n 
% Seg 
Fe ee Oe dd.) min J(x*+a( 2 A, 4)) 


i=l ot i=l 


n 
and set x = x2 4q* 5 A.d, 
= i. 


—nt+1 i=1 
Step 3. Set x? = Xs such that 

I(x, ) = nin Jy) k= 1,...,nt1 (4) 

k 
Usually J) = J 44) 
Step 4. Normalize the current search directions 
with respect to the Hessian matrix of J(x). That 
is, estimate (d,,H d.), i=l1,...,;n, and set 
d. 


where H is the Hessian matrix of J. 


Step 5. Update the search directions using an 
orthogonal matrix P as follows: 
Ti 
di = p., d i=l,...,n (6) 
4 Re] ik -k 


That is, every new search direction is a linear 
combination of all previous search directions. 
Step 6. Set qd. i Beer eye 
where oa, is (with equal probability) 
either lor -l. 


GO TO Step l. 


<« d'- a, 
—{ oi 


As it can be seen, the algorithm involves primari- 
ly the parallel execution of linear searches along 


semi-conjugate directions. An orthogonal matrix 
P is used in updating the current set of search 
directions to another set of directions, which is 
closer to being conjugate with respect to the 
Hessian matrix than the original set of search 
directions. More details about the significance 
of the orthogonal matrix P and test case results 
ra Orthogonal matrices can be found 
in Fis 


cy 


[2] 


[3] 


[4] 


[5] 


6] 


[7] 
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A PARALLEL ALGORITHM FOR SOLVING 
BAND SYSTEMS OF LINEAR EQUATIONS 


Ladislav HALADA 


Institute of Technical Cybernetics 


Slovak Academy of Sciences 
809 31 Bratislava, 


summary 


In this paper a new parallel direct 


algorithm for solving band systems of li- 


near equations is discussed. The algorithm 


is similar to the "Shooting method" pro- 
posed by Bank and Rose [i] and to the LU 
decomposition mentioned by Sameh and 

Kuck [2]. However, the formula from which 
the algorithm is derived is believed to 


be new. 


Let us consider linear systems of 
equations Ax=b, where A is a regular 
band matrix of order n with bandwidth 
(2m+ 1), i.e. ais= 0 for 
fOr: 1s. 1 2 pees 


{i-j| > m and 


a ,yo-m. We assume 


i, itm? ° 
a frequent situation in practise, m<¢<n. 
The algorithm is based on the follo- 
wing assertion 3}; If A is a nonsingular 
matrix the first m components of the so- 


lution vector x satisfy 


(1) (2) {m) (0) 
Zn-m+1] ~n-m41 °°* “n-me1 1| n-m+1 
-() (2) (m) (0) 
“nem¢2 “n-m+42 °°° *n-m+2 * n-m+¢2 
: , . FFI (1) 
, 7 (2) eee zon) x (0) 
n n n m n 


where zis the j-th unknown of the sys- 
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tem 


(2) 


1 Og bn? y ane gills 


Here, Cc,=b andc., i=1,2,...,m is the 
0 i 
i-th column of A. The matrix T is modi- 


fied matrix A. It originates from A by 


omitting columns Cy rCoree er, and by ad- 
joining columns ~Cnemel oot Wen after 
the last column of A, where e. is the 


j-th column of the identity I: Thus, “T 
is a lower triangular matrix of order n 
with bandwidth (2m+tl). 


Lf Xj) 1Xor-++ +X, are known, solving 


the system 


en | (3) 


where components of the column vector d 
m 


are given by d= bj - > ; Ai gXse 
a 


i=z1,2,...,n, we can obtain other compo- 


LS 1 2 pee 


nents of x, because Y,;=*x ; 


itm’ 
n-m holds. 


Thus, the algorithm consists of the 


following stages: 


Stage 1. The solution of the sys- 
tems (2). Let us use for solving (2) Al- 
gorithm II of [4] -. A simultaneous compu- 
tation of (mt+l) triangular systems dif- 
fering from each other only in the right- 
hand side by this algorithm requires 


cD = (2+log 2m) log n - 


- (1/2) (log*2m+log 2m)r 3 time steps u- 


sing no more than 3m“n+mn-8m> processor @. 


Stage 2. The computation of the sys- 
tem (1). Solving this dense system of or- 
der m using Gaussuan elimination with pi- 
voting requires 7, am (10g m-1) + 0 (logm) 


steps using (m-1)“ processors. 


Stage 3. The computation of the sys- 
tem (3). Solving (3) by Algorithm II with 
the computation of dq; requires 3) = ra + 


+ log mt+t2 steps using no more than 


(1/2) m* 


n+ (1/2) annem processors. 


Unfortunately, the algorithm fails 
for the same reason as banded triangular 
solvers. It suffers from the possibility 
of over- or underflow. On the other hand, 
it does not fail if all of the leading 
principal submatrices are singular and it 
can be easily modified when the number of 
available processors is much less than the 
order Of the system, e.g. by using a prac- 
tical band triangular system solver dis- 
cessed in ee However, we have proved 


the following theorem. 


Theorem. Let A be a regular band ma- 
trix of order n with bandwidth (2mrl), 
where me<cn. Then Ax=b can be solved on 
SIMD type parallel machine in 
(4+2log 2m) log nt0{mlog m) time steps u- 


sing no more than 3m“nt0 (mn) processors. 


We remind the reader that the algo- 
rithm can accept effectively matrices 
with a different number of non-zero super 
and subdiagonal lines or matrices of the 
but the elements of the 


uppermost line above diagonal has to be 


semi~band form, 


non-zero. 
P, 


@ throughout this paper log p = log, pl, 


and time is measured in steps. 
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In addition, if A is a matrix of 
Hessenberg form or more general of regu- 
lar m-band triangular form Ca, = 0 for 
j-i>m and as int? for i=1,2,...,n-m) 
the equations (1)-(3) are valid, too, but 
T is dense lower triangular matrix, now. 
In such a case, if one applies Algorithm I 
of (4] for the computation of (2) and (3), 
the total time for solving Ax=b will be 
log*n+3log n+O{mlog m) time steps using 

no more than (15/1024) n° 


sors. 


+ 0(mn*) proces- 
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LSI IMPLEMENTATION OF MODULAR INTERCONNECTION NETWORKS 
FOR MIMD MACHINES 


L. Ciminiera and A. Serra 


Centro Elaborazione Numerale dei Segnali 


c/o Istituto di Elettrotecnica Generale 
Politecnico di Torino 
C.so Duca degli Abruzzi, -n. 24 — 10129 TORINO —- Italy 


oummary 


This paper presents the LSI implementation 


of a class of permutation networks including 
omega, indirect binary n-cube and flip networks. 
Since the set of “interconnection networks 
considered belongs to the class of delta 


networks, defined in {1] , the control functions 
can be easily distributed among several devices. 


A new control scheme, suitable for MIMD machines, 


which allows fully asynchronous’ operations, 
is also introduced. 

The parallel implementation of this 
class of interconnection networks, without 
recirculating or pipelining, is discussed 


in this paper. The minimization of transmission 
delay and implementation cost is considered, ta- 
king into account the constraints imposed by the 


current integrated circuit, IC, technology. 


The basic block, 
ting the whole network, 
work,with n=2 instead of the 2x2 crossbar switch. 
In such a way the total number of modules is re- 


replicated for construc-— 
is a one nxn omega net- 


duced by the factor n/2+lg,n and the complexity 
of the resulting chip is moderate. Since Wu and 


Feng in C2] state the topological equivalence 
between a baseline network and the simplified 
manipulator, flip, omega, reverse baseline and 


indirect binary n-cube networks, using the nxn 
omega network as a basic component, it is possi- 
ble to obtain each of the previously mentioned 
In the following it will be considered 


number 


networks. 
that the 
whole network to 


of inputs and outputs of the 
be implemented is equal to N=2" 
with m»p. Another parameter which should be taken 
into account is the number, B, 
are exchanged between each transmitter-—receiver 
pair;it will be assumed B, unidirectional signals 


and By bidirectional signals, with B, + Bo= B. 


of signals that 


Each 
switching function of a 
the parallel transmission of w, | unidirectional 
signals or w bidirectional signals between each 
input-output pair. Obviously, the larger the va-— 
lues of n, wy and Wy are, the smaller is the num- 
ber of chips required to implement a given net- 
work of the class considered. In effect, n, Wy 
and Wo affect both the complexity of the circuit 


integrated circuit, performing the 


nxn omega network,allows 
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integrated in a single chip, and the number of con 
nections(pins) required; hence the values of n, Wy 


and w should be determined so that the con- 


2 
straints imposed by the current integration and pa 


ckaging technologies are satisfied. The number 


of connections required by the implementation of 


a unidirectional nxn omega network ,£2.(n,wa ), is 
given by the following formula: 


P(n,w1) = 2 wyn +n lg,n + 2 (1) 


assuming a different control signal for every 2x2 
crossbar switch, so that the maximum number of 


allowed presentations may be achieved (nn/2), 


of the 
one nxn omega network depends on the number of 
gate levels. Using 2 lg n levels, the complexity 
is O(nlgn). Implementing the same switching func-— 
the circuit ob- 


The complexity implementation of 


tion using only two gate levels, 
tained faster than the first 
and the number of gates required is given by the 


is implementation 


following formula: 


G(n,w 4) = wm(n+l) (2) 


For bidirectional nxn omega networks it is possi- 
ble to derive analogous formulas. 


From equations (1) and (2) it is possible 
to deduce that the value of the ratio (number of 
is smaller than the cur- 
thus 
increasing the values of n, Wi and W> » the pins 


gates)/(number of pins) 
rent values obtained with the LSI technology; 


available are saturated when chip area is still 
available. 


One of the main design goals in MIMD inter- 
connection networks is to distribute the routing 
functions among several units, each of them con- 
trolling a subset of the whole network, so elimina 
ting the centralized control, which 


performance and reliability battlenecks. 


introduces 


Since the ratio gates/pins of the previous-— 
ly IC proposed is very small, one might guess that 
it is feasible to put in the same chip both the 
connecting subnetwork and its control unit. The 
latter needs a set of input and output signals, 
A more 
attractive solution is depicted in Fig. 2, the 
control of a subnetwork, built with the ICs propo- 
in a dedicated chip. It 


therefore many other pins are required. 


sed here is concetrated 


broadcasts the command signals to the unidirectio- re-issued later. The connections are kept until 


nal (A,B) and bidirectional (C) switching elements the processor, which issued the request, termina— 
of the corresponding subnetwork. The mechanism tes the transfer of information at that time, 
of searching and allocating the path requested it clears the request and releases, stage 
through the network is described below . The re- by stage, all the trunks which compose the 
quest generated by a processor is issued at the whole connection... A control unit for one 
input to the control unit of the subnetwork in nxn omega network, performing the functions abo- 
the first stage, connected with that processor; ve specified, could be implemented using an asyn- 
each request is issued with the binary output de- chronous sequential circuit, which may be integra 
vice address. The control unit in the first stage ted in a single IC.Using formulas (1), (2) and 
receives the request signal and lgon bits of the analogous formulas for bidirectional nxn omega 
output device address. This set of lg om address networks, it is possible to find the values of 
bits is chosen on the basis of the type of network n, Wy, and wy leading to the minimum number of 
implemented. In a omega network, for instance, chips required for implementing a given network 
the most significant lg 7 bits are connected with of the class considered in this paper.’ The re- 
the control unit of the first stage subnetwork, sults of this calculation, for different values 
the next lg n most significant bits are connected of the pins per package availabie, Po, are shown 
with the second stage control unit and so on. On in Table I, where the values of n, w, and wo are 
the basis of the state of the switching elements, calculated: for a network having N=16, B, =26, By 
the active requests and the addresses related to =16. In this table , the values of the number 
them, the control unit decides whether or not to of chips, C, required for implementing the above 
accept the request. if the request is accepted specified network, are also shown. From Table 
at the first stage, a request for the second sta- I it can be seen that, using the implementation 
ge is generated and the status of the switching proposed, few chips are required to built an in- 
elements is changed to accomodate the new connec-— terconnection network. 

tion. When the second stage receives the request . References 

issued by the first stage. an analogous mechanism [1 ]Patel J.H., "Processor-memory interconnections 
starts. Thus, the path requested is searched for for multiprocessor", Proc. 6th Ann. Symp. on 
and allocated, stage by stage, until the target Computer Architecture, April 1979,pp.168-177. 
outlet is reached. If, in any stage, the control 

unit detects a conflict between the requested [2]Wwu C. and Feng T. "Routing techniques for a 
path and the connections active at that time, the class of multistage interconnection networks", 
status of switching elements is not changed and Proc. 1978 Intern. Conf. on Parallel Processing, 
a busy signal is issued back to the processor August 1978, pp. 197-205. 


through the previously allocated connections. When 
the busy signal is received by the requesting pro- 
cessor, the associated request is turned off and. 


' 
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Fig.1. 16x16 indirect binary ‘a-cu | Fig.2. Interconnection: petween 
be using eight 4x4 omega networks. central and switching units. 
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ANOTHER APPROACH TO MAKING SUPERCOMPUTER BY 
MICROPROCESSORS--CELLULAR VECTOR COMPUTER 
OF VERTICAL AND HORIZONTAL PROCESSING 
WITH VIRTUAL COMMON MEMORY 


Gao, Qing-Shi 


Zhang Xiang 


(Institute of Computing Technology, Academia Sinica) 


Summary 


In this paper*, Starting from the 
"Pipeline Vector Computer of Vertical and 
Horizontal Processing" (mxnp type) (7) 
which is based on small and medium scale 
integrated circuits, then we briefly des- 
cribe "Pipeline CVCVHP with Common Memory" 
(mkn type, MX Np type) (2], which is 
introduced because of the development of 
large-scale integrated circuits. This is 
a new type of vector computer employing a 
multiple data stream and multiple instruc- 
tion stream architecture. 

Afterwards, we emphasize a new type 
of supercomputer, i.e. CVCVHP with "Virtual 
Common Memory" rather than with "Common 
Memory". This system may consist of thou- 
sands of cells (or microprocessors) [38]. 

This system has the features as fol- 
lows: 

1.. The main part of the system can be 
implemented by microprocessors. one calls 
it cell. (It is desirable that the design 
of microprocessors will well suit the 
system configuration). There is an arith- 
metic unit, an instruction unit, a main 
memory (S2) and a bipolar memory in every 
cell. The bipolar memory is used for look- 
-ahead and post butters (L), operating 
registers (R), local instruction memory 
(S83) and high-speed memory (S1). 


* Part of this paper was completed 
in Nov. 1973. and part of it was completed 
in Nov. 1977. 
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2. The important difference between 
this system and another microprocessor 
complex system is that the former does 
not need a particular OS. 

38. According to the physical con- 
struction, the system is a multi-dimen- 
sional array processor, its memory is 
distributed. According to the functions, 
or from the view of user, it is a vector 
computer, its memory is common. The user 
can program in vector augmented language 
on a unified memory space. the move of 
vectors among cells is automatic, and is 
overlapping with the execution of arith- 
metic. 

4.According to the different requ- 
irements of various users, the number of 
cells can be, 8, 16, up to thousands or 
tens ot thousands, the system can be used 
alone (with the addition ot 1/0 periph- 
eral processor) or can be connected to 
another large system. Of course, a cell 
can also be used alone (with the addition 
of I/O interface). 

Oo. The system can execute two kinds 
ot parallel computation "multi-instruc- 
tion stream" and "multi-data stream". It 
adopts the principle of virtual common 
memory, the efficiency is higher and the 
range of applications is wider than 
conventional array or vector computer 
(with same capacity, same speed and same 
number of cells.) 

6. As a simplifies system, the instru- 
ction control unit can be omitted from 


REFERENCES 


all the cells, then the high-level langua-— f1J Gao Qing-Shi, Zhang Xiang, A scheme 


ge may be the same as conventional vector 
computers (Such as STAR-100, CRAY-1). 

7. A Virtual memory system is adop- 
table. 23 
An example: 

Using 1024 cells to construct a ten- 
-dimensional array (210). The memory 
capacity of each cell is 16K words, 32-64 
bits per word. The speed of each cell is 
1 MIPS, work frequency is 15MC, 16 bits 
transmission with parallel and serial 
mode, the maximum moving time of fetching 
process is 1.7~2.6 ms, the peak value of 
system is one billion instructions per 
second. 


ae 
ee) 
4 


In this system, solving a linear 


algebraic equation set of 4000 orders with 
the elimination method of column main 


elements, the efficiency could reach 66 


per cent. lf take appropriate measures, 


of Pipeline Vector Computer of Ver- 
tical and Horizontal processing. 
inher report In 1978, 11 and 1975. 7. 
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An Algorithm of Parallel Processors for Theorem 
Proving and dts Applications 


xian 


Chang Zeng 


Institute of Computer Science 
Computer Science Department 
Wuhan University 


Wuhan, 


Hubei 


People's Republic of China 


Abstract. An algorithm on parallel processors 
is discussed for solving three artificial intelli- 
gence problems. The Robinson's resolution prin- 
ciple in the field of theorem proving is simpli- 
fied. <A formila is obtained to calculate the mm 
ber of universal trails in a digraph. And the 
order of a free distributive lattice can be expli- 
citely expressed by the lengths of given generat- 
ing chains. The main results are described in 
Theorem R, Theorem E, and Theorem D, 


KEY WORDS AND PHRASES: deductive algorithn, 
synthetic algorithm, Wuhan Parallel Processor, re- 
Solvent of well-formed formlas, universal trail 
on a digraph, the order of a free distributive 
lattice generated by chains, the speedup of a pa- 
rallel algorithn, 


(1) Introduction 


This short paper describes some results ob- 
tained with Wuhan Parallel Processor(WuPP), which 
is intended to speed up digital computing by use 
of parallelism in Wuhan University, and has been 
studied by our group for almost two years. WuPP 
is a system of computers for MIMD parallel proces- 
sors intended to support programs which consist of 
many independent parallel subroutines [1]. 


The goal for speed up computation with WuPP 
is restricted to much lower levels of hardware and 
software than many other MIMD machines. At this 
low level, parallelism is to befound in virtually 
every program, and the softwares must be rewritten 
or reorganized to speed up computation, But the 
main subject of multiprocessor research has been 
the effort to discover parallel programs which 
constitute the independent parts of the principle 
computation. 


We have examined to solve many problems whith 
can be programed by parallelism described above. 
Among the idealized models of parallel miltipro- 
cessing we have found the best ones are the fol- 
lowing artificial intelligence problems, whose 
general solutions can not be obtained immediately. 
We should examine many special cases, and execute 
many programs on WuPP, then from the output datas 
we could find some desired results, and discover 
the algorithm for the general Solution. We have 
considered the following three problems, 


(2) The Robinson's problem 


Let F, 3 F2, @veoy Fa be n given well-formed 
formilas, and G be another formila cf the first 
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order logic, If the forma F,A F,A..-AK, > G 
is valid, then G is called a consequence of F,, 
Fi, sooy Fne The formila F,a Faneoe- AF, G is 
called a theorem, and the formulas F,, Fy, eeecy 
F,, are axioms [2]. 


The particular formula G is also called the 
conclusion of the theorem, A demonstration that 
@ conclusion follows from axioms is called a 
proof, A procedure for finding proofs is called 
an algorithm of mechanical theorem proving in 
which a major breakthrough was made by Robinson, 
What is the aigorithm of mechanical theorem prov- 
ing, it is called the J. A. Robinson's problem, 


In the field of theorem proving, the Robin- 
son's resolution principle is well- known. Accord- 
ing to this principle, the algorithm of mechanical 
theorem proving can be implemented on a digital 
computer, and we have programed such an algorithm 
on WuPP. Now we define such an algorithm with 
deduction as the following. 


As described above, we know what a particular 
formula G is called the conclution of a set F which 
contains n given axioms F,, Fa, .«., Fro A deduc-~ 
tion of G from F is a finite sequence G,, G., ..., 
G, of formlas such that G; either is a forma 
in F or a resolvent of formulas preceding G;, and 
G,=G. <A deduction of empty forma from F is 
called the proof procedure of F, or the inconsis~ 
tency of F is to be proved. ‘The method used to 
get the proof procedure is called the algorithm 
for proving F. 


We divide such algorithm into two kinds, Ac- 
cording to their usefulness for different problems, 


The first is a deductive algorithm, which is uscd 
to deduce the conclusion of a theorem frem some 
axioms, as many authors had done in the field of 
theorem proving [1, 3]. But in our group the fol- 
lowing theorem is always utilized to simplify the 
softwares and the parallel programs executed on 
WuPP for speed up computation. 


THEOREM R. Let P and Q be two given séts. 
If there exist two sets L,, L, such that P=L,V A, 
Q=L,V B, L,=~L,, then PA Q is a subset of AVB. 


If P, Q are formulas of the first-order logic 
then AV B is the resolvent of P and Q. This 
theorem is equivalent to the Robinson's resolution 
principle, and can be easily proved. We need no 
background in symbolic logic to prove it, only a 
basic knowledge of elementary set theory is enough. 


Similariy, we can simplify some other theorems in 
the field of theorem proving. 


The second is a synthetic algorithm which is 
used to get the general result of a problem from. 
given conditions. Sometime we shall consider some 
problems, from which we can obtain only partial 
datas and certain results at special conditions, 
but we can not make precise decision about the 
conclusion. For example, how many universal 
trails in a given directed graph [lj], what is the 
order of a free distributive lattice generated by 
n given elements [6], We can not get immediately 
the general solutions of Such problems, 


We should examine many special cases, some 
times we need some programs and a lot of computing 
works on parallel processors, then the algorithm 
for general solutions might be found by synthetic 
method. We have got some results with WuPP, and 
shall state some in the following. | 


(3) The Euler's problem 


An evlerian trail in a digraph G is a closed 
spanning walk in which each are of G occurs exact- 
ly once. A digraph is euvlerian if it has such a 
trail [4]. This means an evlerian digraph can be 
traversed by Such a trail. The mumber of eulerian 
trails of a digraph was obtained by Tutte in the 
year 1941 [ Sle 


If a given digraph H can be traversed by at 
least t evlerian trails in which each arc of H 
occurs exactly once, and these t trails are ar- 
ranged in a fixed order, Then every such arrange- 
ment is defined as a universal trail of H. What 
is the number of universal trails in a digraph, it 
is called the Euler's problem, 


If H is a general digraph, it may not be eu. 
lerian, then the outdegree and indegree of some 
vertex may not be equal. Let p be a point dis- 
tinct from any vertex v of H, then we can join p 
and v with suitable or no directed arcs to make 
od(v) = id(v). When v runs over ail vertices of 
H, we get a new digraph D, which is called the 
copresponding graph of H, and this D should be an 
eulerian oneS. 


We have examined many special digraphs with 
their corresponding graphs, and calculated the 
universal trails on WuPP duing the past two years, 
We have the following: 


THEOREM E. Let H be a general digraph, ac=- 
cording to the method described above, we can con 
struct an eulerian digraph D corresponding to the 
given H. Then the number u of universal trails of 
H is equal to the value of eulerian trails of D, 
it can be calculated by the forma 


45 
u=x C «JT (a= 1)! 
i=l 
where d;=id(v,;) and C is the common value of the 
cofactors of M,,, and n is the mumber of vertices 
of H ful. 
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This theorém is a generalization of the re- 
suit of Tutte and Harary (l, 5]. Let G BE a given 
eulerrian digraph, if the eulerian property is 
preserved, how many ways can be found to orient 
its arcs. This is still an open question, But we 
have examined some eulerian digraphs on WuPP, and 
found that the number w of ways to orient a ge- 
neral digraph H can be calculated by w= 2%, 


(4) The Dedekind's problem 


Let P,;, Py, eos, P, be n given propositions, 
which may be used to generate well-fonmed formas 
with finite many applications of *conjunction® and 
"disjunction", but the application of "negation" 
is not permited. What is the number of non-equi- 
valent formulas generated by n given propositions 
with applications of conjunction and disjunction, 
it is called the Dedekind's problem, 


This problem was considered in 1897 by DEDE~ 
KIND [6], He stated as the following: Let P,, P2, 
seey Py be n positive integers. Where the law of 
conjunction means to find the greatest common di~ 
visor, and the disjunction means to find the least 
common mutiple. What is the mimber of integers 
generated by P,;, Po, «.., Py with finite many ap- 
plications of conjunction and disjunction. It is 
the original form of Dedekind's problem. 


In the theory of lattices the problem is 
stated as the following: Let L be the free distri- 
butive lattice with n generators, f(n) be the 
order of L, i. e. the total number of elements 
contained in L including zero and unit. What is 
the exact expression of f(n), it is the form of 
Dedekind's problem stated by Birkholf [7]. We 
have known that f(1) = 3, f(2) = 6, £(3) = 2, 
f(k) = 168, £(5) = 7581,[8], which are agree with 
Muroga's work and can be calculated by hand, 
Using computer, Ward in 196 obtained f£(6) is 
equal to 7,888,35) [9]. 


It took about thirty three years, nobody 
could accept or refuse the Ward's result. Now 
we have executed programs on WuPP, and we are 
confident the result obtained by Ward is right. 
And even more we have an algorithm to calculate 
f(n) for n is greater than 6, which will be pub- 
lished in another paper, 


From the proposition calculus, we know that 
f(n) < 22", even using computer it is difficult 
to find out the explicit expression of f(n}. In 
the following we shall make a little generaliza- 
tion by considering the problem in another way, 


Let L(k) BE the free distributive lattice 
generated by the following k chains with lenghs 
Ty Sy eooy t respectively: 


0 pound XX Ky CXy<eooe XK Xr CX =Ls 
O = Xipe Xy<Xyxeoe CXaso< Xr 54 a 
eeeovvaeoeesonsvnee €On0208660 


O= Xo < Xp S Xp2% oe S XK tC Xe tH 1, 


where 0, I are the zero and unit elements of L(k) 
respectively. 


For k <3, we have completely solved the Dede- 
kind's problem, i. ee we can explicitly express 
the order of L(3) in terms of the lengths of the 
generating chains, The result is the following: 


THEOREM D. Let L(3).be the free distributive 
Lattice generated by three chains of lengths r, s, 
+ respectivelly. And let [r, s, t]be the order 
of L(3), then we have 


i r¢s¢tt2yilriightt 
[Ps 89 HTS F Ste TL ESE aT) 


where rtl=r!(r-1)!...3!2!1! when r=s=t=2, we 
get (2, 2, 2] =980. When t =0, then we have 
tr, 8) = (r+¢s42)!/(r+1)!(s¢1)!, which was ob- 
tained by Birkhoff [71. 


When k = 3, it is called the case of 3 vari- 
ables for Dedekind's problem, We have aiso ob- 
tained some special results for h, 5, 6 variables, 
For example, we have the following explicit forms: 


Db hee gets eae gy 
(s + t)! 


t = “siti? and we have 


Fi, 22) 4 eu) -[3(tt 42 t 7 
t 


2\ 7 
2/t+ 8), 631/t+ 9 
+A ( 9 FX vt )| 
Ls 2,3, tJ=(’Z "\+98(* § °) +2580(*1,” 
t + 10 t+ 12 t +12 
26668 ( 12 ) + 117301 1h }+183958( 6 
fr,s,t}.[r,s,t] = [r,s-1,t+1]-[r,s+1,t-1] 
+(r-1,8,t}«(r+1,s,t]. 
The last formula can be used to calculate the 


large values of (r, s, t} from the smaller ones 
of I» S, te 


where | 


We had examined many special cases and execut- 
~ed a lot of programs on WiPP, before we obtained 
the formula described in Theorem D, Similarly, 
we have got the following explicit form for four 
variables 


Nee ere eee a came be 


Since the a's are very large, the programs used to 
calculate them should be parallel, otherwise the 
running time would be very long. We have: 


a, = ly a,=188, ay = 91,68, ay = 201;700 , 

&o= 2353308, ag= 1618598), a, = 71429138, 

@, = 210763120, aq = 2570176, aj, = 585753336, 
a, = Slhh691h, a, = 325903320, 

&yy = 113456486, ay, = 174548. 


If we define the comlete set for the polyno~ 
mial [1, 1, s, t] as in the next paper (12], then 
the complete set of (1, 1, 6, t] contains the fol- 
lowing 28 constants: 
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1 538 -Th3h9 

141s01.070 1:1.73L859 28),:6807080 

3889051876), 383234658030 2840173234855 

1632168185832 7411016837079 273816862566536 

923954320708219 206%26583910286 1.22933032698892) 

7297541416708 3 306 

1054)1.41.7413127500 

12886752737504,718 1275025533945502 


108 340851:2.22h5518 751936936739514 
4.2594,30756992310 1936572050862h7 6891621)6150)12 
185108336363k12 3523700789128 4238527206900 

2):220155),680 


Knuth stated a problem: "Investigate three~- 
dimensional arrays, in order to see how many of 
the properties of two-dimensional Young's tableaux 
can be generalized." [10]. We conjecture that if 
this problem is solved, then the Dedekind's prob- 
lem will be also solved. 


If we define the speedup of a parallel algo- 
rithm as Lemme and Rice [11], then the Dedekind's 
problem in the case of four variables can be im 
plemented on (q+1)(r+1)(s+1) processors, and 
[Gs Ty 8, t] can be calculated, 
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An Algorithm on Parallel Processing for Theorem Proving 
and Solving Dedekind's Problem 
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ABSTRACT. First in the field of theorem prov—- 
ing, the Robinson's resolution principle is con- 
sidered how to be applied to general resolution, 
Second in the field of graph theory, let H be a 
generaldigraph which can be traversed at least by 
t trails. Suppose this property of H is preserv~ 
ed, how many ways to orient H? This problem is 
solved, and the Theorem E in the previous paper 
can berigorously proved. Third in the field of 
Lattice Theory, some properties and more details 
about the algorithm for solving Dedekind's problem 
are obtained. The complete sets of polynomials 
(1, 2, 3, t], [1, 1, 6, t] and {[1, 1, 1, 2, t] are 
calculated, Finally an algorithm for computing 
pairs of twin primes is described. 


(1) The Robinson's problem 


In the previous paper [1], the algorithm for 
solving three artificial intelligence problems was 
discussed, Now we describe some new results and 
more details of this algorithm obtained about one 
year ago on Wuhan Parallel Processor (WuPP)} with 
parallel programs by our group in Wuhan Univer~ 
sity. 


First we consider the Theorem R in the pre- 
vious paper. Robinson in an unpublished paper 
wrote the following opinon: 

Perhaps Theorem R can be applied to 
general resolution by taken the elements 
of the sets to be models? 
He had given a hint to solve this problem and 
said: "See discussion in my book on compactness, 
topology, and completeness, " (4) 


But we have examined many special cases and 
execute many programs with WuPP and found that it 
is intemately related to the foundations of logic 
and lattice theory, so Theorem R may be proved by 
axiom method, We will discuss it in another 
paper, 


(2) The Euler's problem 


Second we consider the Euler's problem. Let 
H be a given general digraph which can be travers~- 
ed at least by t trails. According to the method 
described in the previous paper, we can construct 
an eulerian digraph D corresponding to H. And we 
can prove the number u of universal trails of H 
is equal to that of D. Therefore it can be cal.cu- 
lated by the following forma: 


(A) o+ Tl (a,~ 2), 


— 
~—_ 


where the meaning of C, n, d; can be found in the 
previous paper, And we have: 


Corollary E. Let H be a given general digraph, 
which may not be evlerian, If its property tra-~ 
versed at least by t trails is preserved, how many 
ways can be found to orient its arcs. We will 
Solve this problem. Let w be the ways to orient 
its arcs, then we can prove the following: 


(B) 


where u is the mumber of universal trails of H, 
The rigorous proof of (A) or (B) is a little long, 
we omit it. But it is not difficut to prove (B) 
from (A). We just need to consider the definition 
of universal trail of H. If the t trails used to 
traverse H are not arranged in a rized order, then 
the equality should be w= t! x2" ° 


w=2, 


(3) The Dedekind's problem 


Third we consider the Dedekind's problem, 
Using the similar method described in the previous 
paper, we can discuss the case of four variables 
for solving Dedekind's problem as the following, 
Let L() be the free distributive lattice generat~ 
ed by chains of length q, r, 8, t respectively, 
and let [q, ry 8, t] be the order of L(k), then 
we have: 


Corollary D. (1) ‘he order [q, r, 8, t] is 
a symmetrical function of the i variables gy r, s, 
and t. For example [q,y ry 8, t]=[q, 8, 5 t]= 
[ds Fy ty SJ=[S, Py ty q] 

(2) When any three, say q, r, S$ of the four 
variables are fixed, then [q, ry 8, t] is a poly- 
nomial of the remaining t, and with degree 
(qt1)(r+1)(s+1). For example [1, 1, 1, t] is 
a polynomial of t with degree 8, 

(3) The polynomial [q, r, 8, t] vanishes 
when t= -2, -3, -li, -5, ecey ~(Q+r+s)-2 For 

a [1, = oe t+] has zeros at t=-2, t=-3, 


(4) he ee the value of [q, ry 8, t] is 
equal to l; so that [> Py Sy -1j = =i]. 

(S) When the variable t is replaced by 
~(q+r+s+t+h) then the value of [q, r, 8, t) 
is not changed, i. @. 

[qo Ty 8; ote » $y ~(qtr+s+t+h)) 
For example [1, 1, 1, t) [1, 1,1, -t-7], espe 
cially Als 1,1, j=[2, 1, 1, - 8] =£(h) = 168, 

(6) When one of the variables, say t, is 
equal to zero then this variable can be stricken 
out from the function [q, ry 8, t]. For example 
[ay Ty 8, OJ=[q, Ty Sle 


It is not difficulty to prove the properties 
(1)--(6) in the above Corollary D. But it is long 
for writing in a short paper, So we omit the proof, 
Using these properties we can easily find the for- 
mila and the algorithm with parallel programs for 
computing Jq, r, 8, t). For example the degree of 
polynomial [1, 1, 1, t] is equal to 8, so we need 


9 values to determine its coefficients, From 
Corollary D, we have 
Lbs ds, by 2d = 166, (1,1,1, OJ] = 2, 
[l, 1, 1,-1] = 0, fl, 1, 1, -2J) = O; 
fl, 1,1,-3] = 0, [1,1,1,-4] = 0, 
[1,1,1,-5) = 0, (1,1,1,-6]= 1, 
[l, 1,1,;-7] = %, [1,1, 1, -8] = 168. 


We already have 10 values of [1, 1, 1, t], so it 
can be completely determined. Therefore we have: 


eye) a) 


+ sia (* f 2) ¢su0(* £ 2) +u95(* % 4) 


sao} *)+ wl 34 


where & : *\ = Att 2)t-, which is agree 


~ t+ 2-k 
with the result of previous paper. The following 


set of constants: 
112 «322 «=«5h0 «695 «(ho OLB 


is defined to be the complete set of the polyno- 
mial 1,1, 1, t. Similarly we have found the 
complete set of polynomial [1, 2, 3, t] contains 
the following 2h constants: 


1 18 


1 1,88 58075 

2857576 7h 73991h 119866,320 
1289252957 987 395302 561):410),803 
2)142211577600 830538700959 22)20686173258 
485790 30087180 8500714724532 120670489307 306 


1387 26071967266 1286809893)7607 951,896375),08 
5586888675375 25197675728000 845021827031 
1961)105018136 29131681235 20071L50),30 


And the complete set of polynomial 1, 1, 6, t 
contains 28 constants, which can be found from the 
previous paper. 


One year ago we computed the constant 183957 
for the complete set of [1, 1, 3, tJ], the running 
time was half an hour. But now we use parallel 
programs, it is about two minutes to get the num 
ber 1288675273750h718 for the complete set of 
{1, 1, 6, t]e We hope the general forma of 
(1, 1, 8s, tJ], can be found from the constants 
given by this paper. This problem is closely re- 
lated to the Muroga's works [2]. 


(1) The twin primes problem 


Finally we have examined many special caseg 
with computer and find an algorithm to computing 
the number Z(N) of the pairs of twin primes less 
than N. We will describe the method which is very 
similar to the sieve of Eratosthenes: 

Let the sequence formed by the natural 
numbers <N be the following: 

(S) 1, 25 3,5 hy e@eeag N 1, N. 
Then the algorithm for computing the pairs of twin 
primes can be defined by ) steps: 
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Step (1). All even numbers <N are stricken 
out (sieved out) of (S) and let i=1. 

Step (2). Let p; be the ith odd prime of 
(S), p,=1, p,=3, and all proper miltiple of p, 
are stricken out from (S). 

Step (3). All mumbers of the form (k+1)p,-2 
less than or equal to N are stricken out from (S), 
and k runs through a&l positive integers. 

Step (1). When p-_,>/N, the algorithm will 
be stopped, otherwise i willbe replaced by the 
value of i+ 1, and go to Step (9), 


For example, let N = 200 we get the following 
sequence: 


(T) 3, 5,12, 17, 29, Wl, 59, 7, 101, 
107, 137,149 , 175, 191, 197. 


Now we need to prove the following two pro- 
perties: 

(P) If and only if the first member of twin 
primes willbe contained in the sequence (fT). 

(Q) When N tends to infinite the sequence 
(T) willbe also tends to infinite. i. e. There 
exists infinite many pairs of twin primes. 


Let p be an odd prime, if p+2 is not a 
prime, then it is composite, and p+2 = q-r where 
q is a prime less than p. Hence p = q:r — 2, and 
r>i1, by the property of Step (i), this prime p 
mst be stricken out of (S). 


When p+ 2 = q and q is an another prime, 
then q-2 = pis a prime. This p can not be 
written as (k+1)-p;- 2, otherwise q = (k+1)p, 
will be composite, Also by the property af 
Step (1), the prime p = q-2, can not be stricken 
out of (S). Therefore we completely proved the 
property (P). 


It is difficult to prove the property (Q). 
We belive that the conjecture "There are infinite 
many pairs of twin primes," is true. And moreover 
we belive also that the conjecture of Danial 
Shanks : 


N dn 
un) ~ 13003236 | Ge 
is true [3]. But the details are intemately 


related to the Dirichlet's Theorem: "Every arithe- 
matic progression ant+b, where a, b are relative 
prime integers, and n runs through all positive in 
integers." We like to discuss this problem in 
another paper, 
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ABSTRACT 


Aspects of the design and implementation of 


CSP/80, a language based on Hoare's comnu- 
hnicating sequential processes, are dis- 
cussed. The goal of the design has been to 


Stay as close to 
as possible. 


Hoare’s original notation 
The goal of the implementa~ 


CH1569-3/80/0000-0173$00.75 €) 1980 IEEE 


tion has been to reduce the amount of rein- 
vention by making utmost use of facilities 
provided by the operating system (UNIX). 
This has shortened the implementation time 
considerably. CSP/80 is to be used for 
evaluating CSP as a programming language 
for distributed processing applications. 
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ENTRODUCTION 


In "Communicating Sequential Processes" 
{1], Hoare proposed an elegant notation for 
programming distributed systems. The nota- 
tion, hereafter called CSP, combines 
Dijkstca's nondeterministic control struc- 
tures for Sequential programming [2] with 
"blocked" input/output for communication 
and synchronization petween parallel pro- 
cesses. In order to evaluate CSP's utility 
as a concurrent programming language, we 
have undertaken to produce ae prototype 
implementation. Our final goal is to apply 
the language in programming several distri- 
buted systems, This paper presents the 
design and implementation of our version of 
CSP, called CSP/80. We discuss and moti- 
vate the important design decisions and the 
deviations from Hoare’s version. As Hoare 
observed in his paper [1, p. 667], his 
notation "should not be regarded as suita- 
ble for use aS a programming language...". 
The work reported in this paper has been 
aimed at producing a suitable programming 
language based on Hoare's notation. 


CSP is one of several recent proposals 
for distributed processing. As Brinch Han- 
sen points out in [3], however, these pro- 
posals must be evaluated based on their use 
in practice. In order to do this experi- 
mental evaluation, one needs an implementa- 
tion of the concept. Because the implemen- 
tation is a vehicle for evaluation of an 
untried approach, the implementation time 
must be kept small. Therefore, we have 
tried wherever possible to use already 
existing facilities. 


Wwe have tried to include in CSP/80 two 
methodological ideas that have been found 
to be valuable in designing other types of 
programs. These are modular programming 
and strong typing. The idea of modular 
programming is that it should be possible 
to design the different modules of a pro- 
gram independently. The requirement in CSP 
that a process must: name all the processes 
it communicates with violates this rule. We 
have tried to remedy this by introducing 
ports and channels. Ports and channels 
have also allowed us to turn CSP into a 
strongly typed language (i.e. all type 
checking can be done at compile time). 
This is a very important characteristic of 
a language that helps in the development of 
reliable software. 


In section 2 we briefly review the lan- 
guage concepts of CSP. In section 3 we 
give the differences between CSP and 
CSP/380. We do not go into great detail 
about the reasons for these differences; 
these are covered in [4]. In section 4 we 
discuss the implementation of the language 
and the important design decisions sade. 
Section 5 concludes the paper. 
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COMMUNICATING SEQUENTIAL PROCESSE 


Cie 


A program in CSP consists of a fixed number 
of parallel processes (which may run on 
distinct processors). Each process con- 
Sists of a series of sequential statements. 
Statements are provided for assignment, 
alternation, repetition, input and output. 
The assignment statement is similar to that 
in other languages. The alternation and 
repetition statements are based on Dijkstra 
{ 2]. Input and output are the only really 
novel concepts provided by the notation. 
An input (correspondingly, output) state- 
ment names a process from which (to which) 
the input (output) is to be received 


(sent). Upon execution of an input (out- 
put) command, the process is suspended 
until a corresponding outpat (input) is 


performed by the named process. At that 
point, the input/output transaction takes 
place and both processes continue execu- 
tion. The I/0 commands thus provide for 
both communication and synchronization bet- 
ween processes. Furthermore, Hoare allows 
the use of input commands in the guards of 
the alternative and repetitive statements. 
Such a guard is selected only if the part- 
ner process has already committed (i.e. 
been suspended due to having requested an 
output to this process). 


In the next section we discuss where, 
how and why we have deviated from Hoare'’s 
notation. 


CSP VS. CSP/80 


Program structure 


A program in CSP/80 consists of a fixed 
number of (separately compiled) parallel 
processes, anda list cf channel declara- 
tions. Each process may have one or more 
input or output ports through which it com- 
municates with other processes. A channel 
declaration establishes a link between a 
port in one process and a port in another 
process. 


As an example, the bounded buffer exan- 
ple of Hoare [1, pj 673] written in CSP/80 
is shown in Fig. 1. The complete syntax of 
CSP/80 is given in Appendix A. 


process produce 
output int Y; 
int s; 


end process 


=o 3 
=» 


process consume 

input int 2; 

int s; 

int sun; 

sum 0; 

*[ 1 -> 2s 
sum 


—_ 


end process 


process X 3:3: 

guarded input int Y; 

guarded output int 2; 

int in; 

int out; 

int buffer[ 9]; 

in = 0; 

out 0; 

*f in < out + 10; 
?buffer[ ink10] = Y 


~> in = in + 13 
Jout < in; 
!Z = buffer[Lout%10] 


-> out = out + 1; 
] 


end process 


/7*buffered version*/ 

int channel from produce.Y 
to X.Y 

int channel from X.Z 
to consume. Z 


/*unbutfered version*/ 
int channel from produce. Y 
to consume. Z 


Note: "%"* is the modulo operator. 


Figure 1: produce and consume in two 
configurations 
[nter-Process communication and 
Synchronization — 
In order to send information from pro- 
cess P to process Q, P must have an output 
port, x, and Q an input port, y. These 


ports must be linked by a channel, dec- 
lared: 

<type> channel from 
<type> is the data 


PeX t0 QV 

type of the information 
being transferred and must be the same as 
the types associated with x and y. The 
actual transfer takes place after both of 
the following have taken place {in either 
order) 3 
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of an 
expres- 


e x has been used as the 
assignment statement in P; 
sion;) 


target 
(!x 


source of 
(? variable 


y has been used as_ the 
asSignment statement in Q. 


Y;) 


an 


The way input/output is done is differ- 
ent from CSP, where communicating processes 
must name one another and a type mismatch 
is caught only at run-time. The use of 
typed ports in CSP/80 allows type checking 
to be done at compile time. The use of a 
channel allows a process to be written 
without explicit knowledge of the name(s) 
of the communicating pactner (s). This 
aliows a process to be connected to differ- 
ent processes without recompiling the pro- 
cess. The connection is performed by a 
"linker", The ability to reconfigure the 
system without recompilation is an attrac- 


tive capability in a distributed systen. 
Figure 1 shows two possible ways that the 
Same producer and consumer may be  con- 
nected. 


The use of typed ports also removes the 


need for one of dHoare'ts constructs. 
Instead of the special Signals, e. Ja 
has(n), we can define a port by the mne- 


monic name, @.ge has. Any I/O through this 
port then has the meaning of has(n). 


Alternative and repetitive commands 


CSP/80 is identical to CSP in this res- 
pect except that we allow output commands 
to appear in guards as well. If a port 
name 1S to appear in guards, however, the 
port declaration must declare the port as 
guarded. This is to enable the detection 
of the anomalous situation where two commu- 
nicating processes both have their [/0 
Statements in guards. This situation, 
which is a form of deadlock, was the reason 
Hoare ruled out the possibility of output 
commands in guards. With our solution, we 
allow more freedom and still provide a mea- 
sure of protection. This issue is dis- 
cussed at length in [3,4]. 


Acrays of processes and channels 


Just as in CSP, CSP/80 allows the use of 
arrays of processes. The effect of the 
following process declaration: 

process PL i:0..9]zie....--. end process 
is the same as having ten processes called 
P(0),P{1)-.. P(9).~ The occurrence of the 
bound variable iin P is replaced by 0 in 
P(0), din Pi}, etc. Any variables and 
ports declared in P are local and therefore 
no confusion can exist in channel declara- 


tions. For example, 
<type> channel ftom P(1).x to 
P(2).yY 


We also 

Form: 
<type> Channel (i: 0..9) from 
P (i+1mod10).y 

Both arrays of processes and 


allow channel declarations of the 
P{i).x to 


channels are 


merely Shorthand notations and do not add 
any power to tne language. They can be 
regarded as a primitive macro processing 
capability. 
IMPLEMENTATION 
Goals 
The language is implemented on a 


PDP11/45 in the 
under the UNIX [6] 


programming language C 
operating systen. The 
overriding concern in the implementation 
haS peen to limit the time and effort 
required for impiementation and still pro- 
vide uS with programming experience in CSP. 
Thus, all features have been restricted in 
Such a way as to produce a usable language 
and also allow for future expansion of the 


system. For example, currently only (sca- 
lar and array) integer and character data 
types are supported. It is relatively 


Straightforward to add other 
the language and the 
sufficiently powerful 
distributed processes. 


data types to 
current language is 
for investigating 


Another example is the implementation of 
the nondeterministic control structures. 
We have made no effort to ensure that the 
selection of guards is compietely randon. 


Our goal was not to investigate nondeter- 
minism. Furthermore, our implementation 
does meet Dijkstra's requirements for an 


implementation of the guarded commands [2]. 


system structure 

The CSP/80 system consists of a translator 
and a linker. The translator accepts one 
CSP/80 process and translates it into ac 
program (which will run @s a UNIX process). 
The linker accepts the names of some CSP 
processes (already compiled by the C compi- 
ler) and alist of channel declarations. 
It produces an executable CSP/80 program 
consisting of the processes communicating’ 
via the channels. Both the translator and 
the linker were written using LEX and YACC, 


the translator writing tools available 
under UNIX. 
Run-time organization 

At execution-time, a CS? program con- 


each 
another 
which 


Sists of a set of UNIX processes 
representing a CSP process, and 
UNIX process called the nonitor, 
coordinates the communication between pro- 
cesses. There is also a direct access 
file, called the channel file, which con- 
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tains the required channel buffers. A UNIX 
pipe conanects all processes with the 
monitor. This organization is shown in 
Figure 2. 


A simple output command fron 
to process Q i5 implemented in the follow- 
ing way: P writes its output in the loca- 
tion in the channel file reserved for out- 
put on the appropriate channel. It then 
sends a message along the pipe to the nsoni- 
tor indicating that it requires an outpat 
service. P then puts itself to sleep (by 
invoking pause, a UNIX primitive). An 
input command is implemented in a similar 
way. 


process P 


The monitor constantly reads its request 
pipe, responding as soon aS a request 
actives. The pipe mechanism provides for 
the queuing of messages that arrive while 
another one is being serviced. If the 
reguest is an input {or output), the moni~ 
tor sets the status of the requesting pro- 
cess aS committed to perform input {or out- 
put). It then checks to see whether the 
other partner is committed. If it is, then 
the output data is copied from the output 
to the input section of the appropriate 
channel buffer (in the channel file). Note 
that channels and ports are implemented 
Simply by a location within the channel 
file that the two communicating processes 
use to write into and read fror. The 
linker assigns the address of this location 
to the partnerc processes. Thas channels 
and ports are conceptual tools for specify~ 
ing the communication, and they allow 
strong type checking and can be implemented 
efficiently in a uniprocessor. 


The more interesting interaction between 
the processes occurs when an I/O command 
appears in a gard. According to the 
semantics of the construct, this guard can 
be chosen only if the partner process has 
already committed. Furthermore, there may 
exist several guards with I/O commands in 
them. This is implemented by the process 
sending an activity check request to the 
monitor and going to sleep. This request 
asks the monitor to wake up the process if 
and when any of the process's partners 
either commit to an I/O operation with this 
process or terminate. Upon waking up, if 
the commitment affects any of the guards, 
then that guard is selected. Otherwise, 
the process issues another activity check 
reguest and goes to sleep. 


A final interaction between the pro- 
cesses and the monitor occurs when a pro- 
cess, just before terminating, sends a nes~ 
Sage to that effect to the monitor. The 
monitor records that information which will 
be of use to the process's partners. 


Guard 


AERIS ED 


selection and fairness 


The implementation 
mands is a compromise 
waiting) and interrupt driven processing. 
In the polled case, the process would con- 
Sstantly test the guards until one of then 
became true {as a result of a partner pro- 
cess making a commitment). This might be 
acceptable if processes were indeed allo- 
cated to distinct processors. On a unipro- 
cessor, on the other hand, this is disast- 
rous because this testing of the guards 
would itself use up some if not all of the 
time othervise available to other processes 
waiting to do useful work {including the 
commitment necessary for the resumption of 
the waiting process). 


of the guarded conm- 
between polled (busy 


In the interrupt-driven 
cess would go 
woken up 
waiting on 
only be 


case, the pro- 
to sleep after asking to be 
when any of the processes it is 

makes a commitment. It would 
awakened when a guard can be 
selected. This is more efficient of pro- 
cessor time than polling but is more con- 
plicated to implement. 


In our implementation, the process goes 


to sleep and is awakened if any of its 
partners has an activity (commitment or 
termination). It is possible that the 


activity does not affect any of the guards, 
in which case the process goes back to 
Sleep after testing all its guards. Thus 
the process is not always busy waiting, 
just sometimes. This implementation is 
much less complicated than the pure inter- 
cupt-driven case and is more efficient of 
processor time than pure polling. 


For Simplicity, our implementation tests 


the guards in purely sequential order. If 
the first guard is always true, therefore, 
the other guards will never be selected. 


Although this is unfair and could lead to 
starvation of certain processes, such sche- 
duling policies are not ruled out by the 
language semantics. It is the responsibil- 


ity of the programmer to write programs 
that donot reiy on specific scheduling 
policies. 


Although the current version is adequate 
for our purpose of the evaluation of CSP as 
a programming language, we envision expand- 
ing the system capabilities a great deal. 
First, we intend to enhance CSP/80 for mul- 
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tiprocessor operation. Design efforts are 
underway for a two processor version to be 
inplemented on VAX 11/780 computers. With 
this implementation meaningful benchmarks 
can be ran. In particular, we would like 
to measure how much tige a process spends 
waiting for its partner to commit and how 
much of this time could be saved if non- 
blocked I/O were used. we would also like 
to enhance CSP/80 to: 


e support more data types 


provide 
handling, 


Simple deadlock detection and 


use a fairer guard selection algoritha. 


CONCLUSIONS 


We have described the design and imple- 
mentation of CSP/80, an implementation of 
Hoare*s communicating sequential processes. 
As fac as we know, this is the first such 
implementation. The implementation is 
heavily based on and uses facilities pro- 
vided by UNIX {and C) to minimize implemen- 
tation time. Although this restricts our 
ability to exert control over process sche-~ 


duling, it has resulted in a quick imple- 
nentation. 
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command : SKIP SEMICOL 


APPENDIX A 
expn SEMICOL 
Below is the modified BNF description of "56 SEMTCOL 
CSP/80. What is Shown has been extracted ' alt 
from the actual input to LEX, a lexical ' PASSTHROUGH 
analyzer, and YACC, a parser generator, ' error 
both available under UNIX. Nonterminals ; 
are in lower case; terminals are in upper- alt - choice 
case. The metacharacters are; alterns 
RBRA 
: "produces" choice : REP 
i LBRA 
| Or 
; alterns : altern 
; end of a production | alterns BOX altern 
The meaning of the terminals is shown in aieen. 
the table following the productions. guard 
ARROW decls stmnts 
; 
guard : bool 
process : PROCESS IDENT decls 
range i bool 
DBLCOL portdec SEMICOL 
decls decls io 
stmnts 
END PROCESS decls io 
, 
range : /* empty */ bool : 
| LPAREN IDENT expn 
COLON NUM | bool SEMICOL 
RANGE NUM expn 
RPAREN : 
; expn : NUM 
portdec : /* empty */ | STRING 
; portdec guarded INPUT | QUOTE 
PORT TYPE | IDENT 
dim /* element size */ sub 
IDENT | LPAREN 
dim /* number of ports */ expn 
SEMICOL RPAREN 
: i e@xpn op expn 
decls >: /*® empty */ i; e@xpn op 
| decls decl SEMICOL | Op expn 
; 
decl : TYPE sub : /* empty */ 
IDENT i LBRA 
dim expn 
; RBRA 
dim : /* no bound => Scalar */ 
| | LBRA op : OP 
NUM i EQUALS 
RBRA : 
; io >: QUERY 
guarded : GUARDED target 
i /* empty */ EQUALS port 
; i EXCLAM 
stmnts : command port 
: stmnts EQUALS expn 
command . = 


target : IDENT 
tsub 


w@e 
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tsub 


port 


psub 


/* empty */ 
LBRA 
expn 
RBRA 


IDENT 
psub 


/* empty */ 
LPAREN 

expn 

RPAREN 
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ARROW 
BOX 
COLON 
DBLCOL 
END | 
EQUALS 
EXCLAM 
GUARDED 
IDENT 
INPUT 
LPAREN 
LBRA 
NUM 

OP 


PASSTHROUGH 
PORT 


RPAREN 
SEMICOL 
SL 
STRING 


se aa 


Uy end" 

WT tf 

we 

"guarded" 

C identifier 

" input" 

"(tt 

a 

unsigned integer 

wi~w he uy n/t naw 
9 ? ? 


Wen wy tk TF "oat 
g i] ] 


of 
eae 


a line with "#" in column 1 


vw por cu 
HO if 


a Single character delimited 


by single quotes 
ih} 


Lee 

mje 

wet u 

"yt 

ey 

tskip" 

a string delimited 
quotes 

"int" 


by double 


A COMPREHENSIVE FRAMEWORK FOR EVALUATING 
DECENTRALIZED CONTROL 


John A. Stankovic 
Department of Electrical and Computer Engineering 
University of Massachusetts 


Amherst, Massachusetts 


Abstract -- Effective decentralized control 
algorithms will help achieve many of the poten- 
tial advantages of highly cooperative distributed 
systems. Currently, there is no unified approach 
for developing and analyzing decentralized con- 
trol algorithms. This paper describes a compre- 
hensive framework that can serve such a purpose. 
Highlighted in the framework are the underlying 
principles of distributed systems and the need 
for effective evaluative techniques. A partial 
example of the application of the framework to a 
simple decentralized job scheduling algorithm is 
also presented. 

1. Introduction 

The dramatic reduction in computer costs 
coupled with the potential advantages of connect- 
ing computers in a network makes distributed 
processing systems inevitable. These potential 
advantages include increased resource sharing, 
better performance, higher reliability and easier 
extensibility than possible with uniprocessors. 
However, current distributed systems achieve 
these advantages in a very limited manner due to 
the multitude of "new" problems that distribution 
causes. Foremost among these problems are the 
high cost and critical nature of centralized con- 
trol. These two issues must be resolved before 
the potential advantages of distributed process-— 
ing can be realized to a large degree. 


This paper describes a comprehensive frame- 
work for decentralized control. The framework 
aids the development and analysis of decentral- 
ized control algorithms. These algorithms can 
then be used to eliminate the high cost and cri- 
tical nature of centralized control. Although 
there is a wide spectrum of distributed systems, 
the framework concentrates on one specific type 
of distributed processing system that is still 
in the early research stage. Specifically, it 
addresses distributed systems characterized by 
decentralized system-wide control of resources 
for the cooperative execution of application 
programs. By decentralized system-wide control 
we mean that overall executive control is exer- 
cised through the cooperation of decentralized 
system elements to form a single organism [21]. 
For the sake of brevity the term "system-wide" 
will be dropped when speaking of decentralized 
control in this paper. The proposed framework 
is applicable to the decentralized control of 
any function that must operate with incomplete 
or inconsistent data and under strict time re- 
quirements. Such functions might include rout- 
ing, scheduling and resource allocation. 
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The framework is described in section 2. As 
an example, a new decentralized job scheduling 
algorithm is partially described and evaluated by 
means of the framework in section 3. Finally, 
the usefulness, potential and limitations of the 
framework are summarized in section 4. 

2. A Framework for Decentralized Control 

Currently, there is no unified approach for 
developing and analyzing decentralized control 
algorithms. In this section a comprehensive 
framework that can serve. such a purpose is devel- 
oped. First, the minimum requirements of the 
framework are stated, and then the framework it- 
self is described. Since this field of research 
is in its infancy, the philosophy behind the 
development of the framework is to allow for easy 
extensibility and modifiability as new fundamental 
principles of distributed systems are discovered. 


Requirements 


The minimum requirements of the decentralized 
control framework are: 


to address the central issues of decentralized 
control including, 


1) 


a) concurrency, 

b) operation in the presence of missing, 
incomplete or erroneous state information, 

c) uniqueness in time and space principle 
(see 2.2), and 

d) cost (overhead) of the algorithms, 


2) to enable meaningful evaluation of the decen- 


tralized control algorithms, 
3) to provide a convenient structure for the 
development and comparison of new algorithms, 


4) to be generalizable to all functions, and 

5) to allow for the incorporation of new research 
results. 

The Framework 


In the development of the framework, a distri- 
buted system is viewed as a collection of func- 
tions, {a> where each function f. of the system 
must be controlled by a decentralized system-wide 
control algorithm, X55 which utilizes the set of 
state information {Y,} to achieve a set of goals 


{Z,}. As an example, the function f. might 


include routing, message communication, schedul- 
ing, resource allocation, data management, and 
distributed applications. Then for each f. we 


develop algorithms Kaas Minds Oe where each 


i2 13°) ir? 
Xi has different sets of state information tY, 3. 


The set of goals (requirements) {Z,} is chosen by 
the designers depending on the function f. and 


the application. A formal specification of the 
requirements is necessary to fully evaluate the 
solution. Since formal specifications is an 
active research area, the current framework per- 
mits an informal specification of the goals 
(requirements) as the set {Z,}. 


The intent of this approach is that for a 
given fs various decentralized control algorithms 


ek can be developed and compared as a function 


of the state information and design goals (infor- 
mally stated at this time). Important research 

questions include the amount of state information 
required to properly control function fi5 how that 


information is accessed, and how many distributed 
entities decide on the resultant control choices. 
The formulation of the problem in these terms is 
necessary to meet conditions 2, 3, and 4 of the 
framework, although not sufficient. 


The next step in the development of the 
framework is to incorporate the central issues of 
decentralized control known at this time. 


Concurrency. Each algorithm oe is imple- 
mented by multiple entities e,> Corre ee, running 


asynchronously and acting together but without 
central control or data. 


Operation in the presence of missing, incom- 


plete or erroneous state information. A fundamen- 
tal characteristic of distributed systems is the 
long and unpredictable delays experienced in in- 
terprocess communication giving rise to missing 

or incomplete state information. This implies a 
great need for algorithms that can effectively 
operate under these conditions. In general, dis- 
tributed systems will also experience greater 
probability of errors (hardware and software) 

than uniprocessors giving rise to the greater need 
for resilient algorithms. 


Uniqueness in Time and Space Principle. One 
central principle confronting the design of de- 


centralized control algorithms involves dealing 
with the absence of uniqueness both in time and 
space [26]. This characteristic of distributed 
system implies that the multiple, decentralized 
entities implementing the control algorithm get 
either a partial and coherent (i.e., observations 
are made at the same moment in the system--univer- 
sal time) view, or a complete but incoherent view 
of the system. The consequences of this charac- 
teristic are not fully understood, but it is a 
critical distinction between centralized and 
decentralized systems. 
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The framework addresses the "uniqueness of 
time and space" issue, by viewing five dimensions 
of decentralized control. 


1) Global Environment - this is the sum 
total of all the local environments at one inst- 
ance of universal time. For decentralized control 
algorithms it will be impossible to know the 
global state accurately. Yet, some information 
about the global environment is necessary for 
system-wide control algorithms. 


2) Local Environment - the state of the 
machine on which this entity is executing. Some 
subset of the information about the local environ- 
ment will also be used by the algorithm. In most 
systems this information is assumed to be correct 
and timely. There are no special assumptions 
about the correctness of the local data in this 
framework. 


3) Algorithm - The algorithm's logic obvi- 
ously plays an important role in the effectiveness 
of the decentralized control. The algorithm's 
logic cannot assume that it knows or can construct 
the absolute chronological ordering of events, nor 
that the set of entities implementing the algo- 
rithm perceive identically the set of events in 
the system. 


4) Data - This is the actual information 
about the global and local environment used by 
the algorithm. The data may be missing, incom- 


plete or in error. 


5) Time - There is no universal time refer- 
ence. 


In decomposing the "uniqueness of time and 
space” principle into these five components, we 
believe that the implications of this principle 
for décentralized control algorithms can be 
better understood and algorithms being developed 
will better address the important aspects of the 
problem. The example of the next section should 
help clarify this point. 


Overhead of the Algorithms 


In order to address the overhead of algo- 
rithms issue, the algorithms, Kay must meet the 


following conditions: 


a) execute to meet strict time requirements, 
b) be decentralized (i.e., oe will be 


implemented by multiple entities acting 
together but without central control or 
data), 

c) be able to operate with uncertain, miss-~ 
ing or erroneous data, and 

d) require no more than a specified amount 
of memory. 


The evaluation of decentralized control algo» 
rithms will consist of two parts. The first is an 
absolute evaluation that determines if the algo- 
rithm méets its requirements. The second is a 


comparative evaluation of different algorithms 
that meet the requirements. At a minimum the de- 
centralized control framework requires that the 
following parameters be part of the evaluation: 


o performance (e.g. 
throughput, 

o logical correctness (absence of deadlocks, 
cycles, etc.), 

o resiliency (capable of operating in the 
presence of failures as well as recover- 
ing from failures), 

Oo overhead (execution time, memory, and 
communication costs), 

o stability (presence of an anomaly should 
not have chaotic effects), 

o fairness, 

o extensibility (the algorithm should easily 
control additional resources of the same 
type); 

o cost and difficulty of initialization, and 

o understandability. 


response time and 


Although it is not possible (to date) to quantify 
many of these parameters, the choice of a practi- 
cal algorithm should take all of these parameters 
into consideration. In general, the measurements 
of these parameters for decentralized control algo- 
rithms are open research questions. As part of 
the extensibility of the framework 1) new para- 
meters may be added to this list, and 2) new 
techniques for evaluation of these parameters can 
replace techniques shown to be inferior. Current- 
ly, the appropriateness of different mathematical 
techniques (mathematical programming, dynamic 
programming, game theory, decision theory under 
uncertainty, and decentralized control theory) 

for use in the evaluation of the performance para- 
meter is under investigation. 
theory under uncertainty utilizing a Bayesian 
decision strategy seems promising. 


The elements of this decentralized control 
framework are meant to be general for all func- 
tions. Individual functions may have certain 
specialized characteristics that require exten- 
sions of this basic framework to deal with these 
characteristics. 


3. Decentralized Control of Job Scheduling 


This section provides an example of how the 
proposed framework for decentralized control 
might be applied to the development and evaluation 
of a new decentralized control algorithm for the 
function of job scheduling. For the sake of | 
brevity some issues of the framework are summarily 
dismissed. In practice, every issue of the frame- 
work would be addressed in detail. A network of 
seven hosts configured as in Figure 1 is assumed 
for this discussion. 


Using terminology of the framework, the 
function f. to be addressed i's job scheduling. 


Theré exist seven entities Cys Ons vers Cy that 
taken’ together implesent f.- These entities exe 


cute on hosts 1, 2» ««., 7 respectively. The 
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Presently, decision © 


arbitrary choice at this time. 


local state information at host i is the length 

of the queue of jobs waiting to enter the system 
at host i. The global state information used at 
host i is host i's perception of the queue lengths 
at the other host locations. Note, that the 
framework allows iterative changes to these state 
information quantities for direct comparison. 

The primary goal of the algorithm is assumed to be 
a high throughput of jobs. In practice the re- 
quirements on throughput would be more precise 

and the requirements on all other parameters list- 
ed in section 2 would also be addressed. 


Briefly, the intent of this decentralized 
control algorithm is to perform load balancing at 
the job level. Periodically, each entity, ey 


updates its Workload Table, sends load estimates 
to its neighbors, and performs scheduling which 

might include movement of jobs to other hosts(a), 
These factors are part of the overhead costs of 

the algorithm. 


Figure 2, conceptually illustrates the Work- 
load Tables that are maintained at each host (in 
actuality not all columns need be retained be- 
tween updates implying a low memory overhead). 
Table i exists at host i. The first column of 
each table i is host i's view of the system. The 
additional columns in table i correspond to the 
nearest neighbors (defined as having a direct 
physical interconnection) of host i, e.g. in 
Figure 1 host 1 has nearest neighbors 2 and 4. 
Hence TABLE-1 of Figure 2 has 3 columns labelled 
1, 2 and 4. Conceptually, these additional 
columns of the table are host i's perception of 
its nearest neighbors view of the system. 


The actual values in the table are workload 
estimates calculated based on the state informa- 
tion chosen as part of the algorithm. In this 
example, this is simply the number of jobs in a 
queue. In general, host i can determine precise- 
ly the number of jobs in its own queue (accurate 
local data) and therefore will believe his own 
estimate rather than his neighbors perception of 
his workload. These values are the boxes marked 
with vertical lines in Figure 2. Since nearest 
neighbors of host i are only 1 step away, their 
estimates of their workload as passed to host i 
will be only slightly our of date and in general 
be a better estimate than estimates other nodes 
have of them. Therefore, host i will assign a 
higher probability of correctness to nearest 
neighbors estimate of themselves (boxes marked 
with horizontal lines). All other estimates are 
grouped into a third probability category. If 
the precise configuration of the network is 
known, weights could be assigned to the estimates 
proportional to the distance from host i. In the 
third probability category, host i determines the 
workload by computing an averagé of the columns 
of nearest neighbors. Using the average is an 
Only after a 


(a) 


Not addressed in this paper are implementation 
issues of job movement, such as data transla-— 
tion if non-homogeneous hosts are inyolyed. 


proper evaluative comparison will the choice of 
an average be substantiated. 


Each table is periodically updated by a host 
using messages from its nearest neighbors. For 
example, host 1 receives messages from host 2 and 
host 4 containing their view of the system (i.e., 
their column vectors). This is global data. 

Host 1 then recalculates column 1. To do this 
host 1 looks at its job queue to obtain the number 
of jobs waiting at host 1 and places this number 
in the first column, first row. It then takes 
host 2's view of 2 and places this number in the 
first column, second row, then host 4's view of 4 
is placed in the first colum, 4th row. All other 
entries in the first column are calculated by tak- 
ing an average of host 2 and 4's perception of 
other hosts. In general there may be more than 
two columns (see Table 2). 


Periodically a scheduling decision must be 
made. If host j is substantially less busy than 
host i then some number of jobs will be moved to 
j from i. Both the substantial difference para- 
meter and the number of jobs to move are import- 
ant variables. In this algorithm a substantial 


difference is chcesen to be 3 jobs and A : . ; 


jobs are moved assuming this calculation results 
in a positive number. The jobs moved are taken 
from the back of the queue to account for some 
degree of fairness. At this point the algorithm 
is completely ad hoc. A substantial evaluation 
is required before we attest to the usefulness of 
this algorithm. 


In this example, the variables that must be 
varied and evaluated are: 


o the substantial difference variable, 

c using an average or should other weight- 
ing schemes be used, 

o the period of update, 

o the probability assigned to 
neighbor view, 

o the state information used, 

o the number of jobs to move. 


the nearest 


and 


Note that during the entire development of 
the algorithm, the five dimensions of control are 
constantly kept in mind. The global environment, 
the local environment, the algorithm's logic per- 
formed by multiple entities which do not perceive 
the identical set of events in the system, the 
missing, incomplete or erroneous state of the data 
used by the algorithm, and the fact that there is 
no universal time reference are all incorporated 
into the algorithm either explicitly or implicitly. 


Finally, in addition to the variables just 
mentioned the algorithm must also be evaluated 
according to the 9 major evaluation parameters of 
the framework. Performance can be evaluated by 
any of the standard techniques; analytical models, 
simulations or implementations and measurement. 
This is, of course, easier said than done. In 
many cases closed form solutions are not possible 
and simulation studies will have to be used. 
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However, we are actively pursuing the use of deci- 
sion theory under uncertainty as a mathematical 
treatment of the evaluation of performance. 


The other evaluation parameters (logical 
correctness, resiliency, overhead, stability, 
fairness, extensibility, cost and difficulty of 
initialization, and understandability) will not 
be discussed in this paper. However, the reader 
might notice that the algorithm presented has many 
of the same problems as the original ARPA routing 
algorithm (e.g. ping-ponging). 


4. Conclusions 


The main ideas behind the development of this 
framework are 1) to provide a structure inwhich. 
to think, develop and analyze decentralized con- 
trol algorithms, 2) to provide a convenient 
mechanism for a more meaningful comparison of 
proposed algorithms (meaningful in the sense that 
the assumptions, strengths and weaknesses of the 
algorithms are addressed), and 3) to encourage 
the development of more mathematical techniques 
for the evaluation aspects of the framework. 


Currently, a major limitation of the frame- 
work is the scarcity of effective techniques for 
evaluating the parameters. A mathematical formu- 
lation of the problem is being sought. We are 
investigating the possibility of using decision 
theory under uncertainty, cooperative game theory, 
utility theory and mathematical programming. 

Other evaluation parameters like understandability 
might always be subjective but nevertheless should 
be addressed as best as possible. Finally, the 
framework described in this paper is merely the 
beginnings of a comprehensive framework. 
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DIRECTIONS FOR USER DEFINED COMMUNICATION 
FOR DISTRIBUTED SOFTWARE 


Robert B. Kolstad & Roy H. Campbell 
Department of Computer Science 
University of Illinois 
Urbana, Illinois 61801 


Summary 


Advances in hardware technology have decreased 
costs of processors and memory, thus permitting 
collections of processors to be coupled cost- 
effectively into distributed computing systems. 
The distribution of computing resources may facili- 
tate access speed, physical control, and contention 
reduction (though load sharing may become more dif- 
ficult) [24]. The interconnection of computer 
resources to allow processes to communicate is a 
difficult task, though success has been achieved in 
closely coupled environments [21], [4], [1], [19]. 
Loosely coupled environments’ such as networks do 
not yield elegant solutions so quickly. Much 
research proceeds on low level network protocols 
and a few researchers have enhanced the _ users” 
ability to communicate between processes in a 
nonhomogeneous environment [14], [26], [24], [15], 
[13], [22]. 


Our criteria for implementing software for con- 
nected systems are evolving with gains of knowledge 
about and experience with these systems. We 
believe that a modular specification technique 
which allows separation and static description of 
synchronization, concurrency, and data access is 
necessary and that it is important to have the 
ability to develop software for connected systems 
(utilizing user specified communication) ina uni- 
form, top-down manner. The ability to specify and 
implement true concurrency along with freedom of 
concern (during specification, design, and coding) 
of actual physical embodiment of a process’s execu- 
tion are fundamental. Desirable solutions meeting 
these criteria will include few extensions to 
current thinking and hide implementation details 
from the programmer. Certain qualities enhance 
general programming languages: conciseness, strong 
typing and user specifiable types, a direct rela- 
tion between an algorithm’s complexity and the com- 
plexity of its representation, separation and 
orthogonality of available language constructs 
(constructs should not overlap in function), and 
the ability to make static declarations (instead of 
data-dependent ones). These properties enhance 
modifiability, reliability, portability, readabil- 
ity, maintainability, and promote higher produc— 
tivity. Unfortunately, few high level languages 
offer facilities to exploit abstraction of syn- 
chronization, concurrency, and communication. 


Other researchers have proposed different 
methods for interprocess and interprocessor commun- 
ication. Explicit send/receive of messages (some- 
times with automatic message encoding) has been 
proposed by [20], [15], [13], [16], [19], [18], 
[24], [28], and [25]. Signal/wait is similar to 
send/receive and has been proposed by [2], [3], and 
[16]. Many schemes rely on shared memory [4], [5], 
[2], [3], [22]. Those schemes which allow communi- 
cation in a loosely coupled environment usually 
require fixed configurations [4], [5], [15], and 
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[10]. All schemes so far (except [16]) use dynamic 
synchronization methods or simple mutual exclusion. 
The schemes of [10] and [16] provide facilities 
similar to our proposal below but at the expense of 
extending language syntax. 


Path Pascal was developed by augmenting Pascal 
[17] with a small number of orthogonal constructs 
to specify concurrency, encapsulation, and syn- 
chronization. The Path Pascal object encapsulates 
a set of data, a set of services to operate on the 
data, initialization for the data, and a specifica- 
tion of the synchronization for the services. Path 
Pascal [8], [21], [6], [7] contains path expres- 
sions [9] for synchronization and the process 
declaration for concurrency. Although Path Pascal 
has only been used in a closely coupled environment 
(with shared memory), we believe its object con- 
struct models the desired behavior of a (possibly 
remote) service. 


Path Pascal objects are normally used in a mul- 
tiprocessor environment with shared memory (or a 
multiplexed uniprocessor environment). We propose 
that their implementation be extended to encompass 
not only tightly coupled systems but also loosely 
coupled ones. In this new methodology, objects 
exist on any of the connected processors and com 
munication between them is restricted to the invo- 
cation of objects” operations and return of var 
parameters. This invocation represents a transfer 
of control from the invoking process to the (possi- 
bly remote) object. Flow of control is always 
explicitly and deterministically controlled by the 
user: invocations of processes create new flows; 
terminations of processes destroy old ones. This 
control methodology resembles that of [12], 
strongly resembles [10], and is different from 
fll] which is the basis of several previous 
schemes. 


This networking of objects is achieved by com 
piling invocations of "foreign" objects to calls on 
special communication routines which encode the 
parameters of the invocation (including references 
to other objects) [29], transmit them to the 
foreign object°’s host, await their return, decode 
return arguments, and return control to the invok- 
ing process. This represents only a change in the 
scoping and compilation of objects and is a dual of 
message passing systems [23]. 


The advantages of objects hold: encapsulation 
exists for each object (each object is represented 
on one machine) synchronization specifications are 
maintained, and data can be manipulated by all 
processes (local or remote) possessing the capabil- 
ity [12] for invoking the data’s object”s opera- 
tions. Network objects extend the convenient mani- 
pulation of shared data to loosely coupled systems 
and require no changes or extensions to Path 
Pascal’s syntax {though separate compilation 
becomes desirable). The link editor deduces’ the 


location of objects and imposes performance penal- 
ties only on true foreign object references. 


Instantiation, naming, and compiling of new 
objects on different hosts requires special care 
[27]. The communications template of an object 
will be portable and is distributed to remote pro- 
cessors which require communication facilities. 
Binding of Path Pascal processes and objects to 
individual processors can be performed any time 
before execution of the process or object begins. 
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ABSTRACT 

The use of the SIMD (single instruction stream 
multiple data stream) mode of parallelism to 
perform the speech analysis task of Linear predic~ 
tive coding is explored. Linear prediction 
represents one of the major analysis techniques 
for speech compression, transmission, and recogni- 
tion applications. Parallel algorithms to perform 
Linear prediction have been developed, and are 
evaluated in terms of the number of arithmetic 
operations and interprocessor data transfers need~ 
ed. From the algorithms, architectural require- 
ments such as machine size and interconnection 
network capability are analyzed. 


I. INTRODUCTION 

Because of the complexities involved in a gen- 
eral purpose parallel system, it is becoming ap~ 
parent that one practical way to harness the power 
of large-scale parallel processing may be to con 
sider its use as applied to a specific type or 
class of tasks. One area which appears to be well 
suited for such consideration is speech process- 
ing. Speech analysis, performed for either data 
compression or speech recognition purposes, in- 


volves substantial computation on vectors and ar- 
rays. SIMD (single instruction stream ~ multiple 
data stream [6]) parallelism may therefore be ap- 


plicable to a number of speech processing tasks. 
Studies of how SIMD machines can be used to per- 
form fast Fourier transforms C25] and pitch detec- 
tion £13 have confirmed that speech processing 
operations can benefit from the SIMD mode of 
parallel processing. 

In this paper, the use of SIMD machines to per- 
form the speech analysis operation of Linear 
predictive coding ([2,11,12,16] is explored. 
Linear prediction is closely related to techniques 
in time series analysis [4], Kalman filtering [7], 
and Wiener filtering [C28]. It is one of the prin- 
cipal analysis methods used for speech compres- 
sion, transmission, and recognition (2,16,23], and 


is applicable to problems in neurophysics and se- 
ismic signal processing (11). 
A general model of an SIMD machine is assumed 


for the development and analysis of paratlel 
Linear prediction algorithms. The SIMD machine 
model consists of a control unit, a set of N=2" 


processing elements (PEs), each a processor with 
its own memory, and an interconnection network 
[19]. The control unit broadcasts instructions to 
all PEs, and each active PE executes each instruc” 
tion on the data in its own memory. The instruc~ 
tion is executed simultaneously in all active PEs. 


This material is based upon work supported by the 
National Science Foundation under Grant 
ECS-7909016. 
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The interconnection network enables data to be 
transferred among the PEs. Each transfer is ex- 
pressed in terms of an interconnection function, 
where interconnection function f 1s a bijection on 
the set of PEs which transfers a data item from PE 
j to PE f€i). The transfer occurs simultaneously 
for all i for which PE i is active (17]. 

Detailed SIMD algorithms to perform Linear 
prediction analysis are given in [26]. In these 
algorithms, some known SIMD programming techniques 
have been extended [8,14] and new methods are in- 
troduced. In this paper, the relative complexi- 
ties of corresponding serial and parallel algo- 
rithms are reported, and the machine size and in- 
terconnection network requirements of the algo- 
rithms are presented. The network requirements 
are expressed in terms of the interconnection 
functions which are executed. The ability of in- 
terconnection networks in the Literature to per- 
form the required transfers is discussed. 


II. LINEAR PREDICTIVE CODING 
Speech production is commonly modeled as a 
filter driven by an excitation component. The 
filter represents the configuration of the vocal 


~ 7.e., the positioning of the mouth, nose, 
and throat. The excitation represents the air 
flow from the lungs which has been either 
transformed into a periodic sequence of pulses by 
the vocal cords (the pitch in the production of 
pitched sounds), or set into rapid, 'noise~like" 
motion by being forced past some constriction 
(e.g., the teeth against the lower lips in the 
production of an "f"). 

Linear predictive coding (LPC) analysis 
operates on a sampled signal {s}, where, if m is 
an integer variable, s(m) represents the m-th sam- 
pled value of a continuous-time speech signal. In 
the Linear prediction model, it is assumed that 
each -sample s(m) of the signal {s} can be ex- 
pressed as the sum of two components, one a 
weighted sum of the previous p samples, and the 
other a residual component 6(m) which may differ 
for each s(m) [2,11,16]: 


tract 


p 
> a(k)s(m-k) + 6(m). 
k=1 


s(m) 


The weighted sum portion can be interpreted as the 
"oredicted"” value §(m) for speech sample s(m. If 
jt is assumed that each s(m) can be approximated 
by §8(m), then the predictor coefficients (a(k)'s) 
can be obtained by minimizing the total squared 
prediction error, defined as 


2 


2 wees i 
ES = 2ifs(m)-8(m)I° = PEs (m)- 5 alk) s (mk) J (1) 
m m k=1 


The minimization of EC 


set of equations 


is performed by solving the 


aE° _ 
Jat * 9 1<k<p. (2) 
By choosing the interval over which the Linear 


prediction analysis is performed to correspond to 
an interval over which physiology precludes a sig- 
nificant change in the vocal tract configuration, 
the Linear predictor will accurately model the vo- 
cal tract, but will not accurately model the exci- 


tation. For this reason, the Linear prediction 
coefficients will describe the components of the 
speech due to the slow changing, "predictable" 


configuration of the vocal tract, while the error 
between s(m) and §&(m) will be primarily due to the 
less regular excitation component. Linear predic— 
tion is therefore used in speech analysis to ob- 
tain  characterizations of the vocal tract and ex- 
citation components of speech. 

The number of samples s(m) used in obtaining a 
set of predictor coefficients will typically be 
between 100 and 400, corresponding to 10-20 mil- 
Liseconds of speech, depending on the rate at 
which the original speech signal was sampled. The 
Linear prediction analysis will therefore be per- 
formed between 50 and 100 times for a second of 
speech. Typical values of p, the number of terms 


used in the approximation of s(m), will be between. 


6 and 25 [12]. 

Different assumptions about the range of m _ in 
(1) yield different formulations of linear predic~ 
tion, on which different techniques to solve the 
system of equations in (2) can be used. The as- 
sumption that s(m is Q outside the interval 
O<m<M, for some M, results in the autocorrela- 
tion method [12,16]. This method possesses some 
desirable computational properties, but the as- 
sumption that s(m) = 0 outside the given § interval 
iS not, in general, true. The assumption that m 
is to range over a fixed interval p<m<M,_ but 
that the signal may be non-zero outside that in- 
terval Cin particular, that s(l), O< l <p, need 


not be zero) results in the covariance method 
[2,12,16]. This is a more accurate model of 
speech, but the solution of the equations in (2) 
is more expensive than with the autocorrelation 
method. Both methods are widely used. 
III. SIMD ALGORITHM ATTRIBUTES 
Under the assumptions of the autocorrelation 
method, the predictor coefficients, (a(k)'s), can 
be obtained by solving the system of equations 
C11,16I: 
p 
y> alk RCli-k|) = RC) 1<i<p (3) 
k=1 
where the R(i)'s are the short-time autocorrela- 
tion functions: 
M-1-7 
RCI) = DO s(m)s(mti) O<i<p 
m=0 
Equivalently, the predictor coefficients can be 


found by solving the matrix equation: 
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Ra =R (4) 
where R and a are the p-element column vectors of 
elements R(i) and aCi) respectively for 1 < i < p, 
and R is the p by p matrix in which 
Ri,k) = RCli-k]), O < i,k <p. RQ is a Toeplitz 
matrix, i1.@., it is symmetric, with all elements 
in each diagonal being identical. 

Obtaining the predictor coefficients using the 
autocorrelation method consists of two steps: com- 
putation of the RCi)'s and solution of the system 
of equations in (3). SIMD algorithms to compute 
the R(iJ's in N > M PEs were given in (C241. 
Durbin's method [16] is an iterative serial tech- 
nique for solving a system of equations involving 
a Toeplitz matrix. A parallel algorithm based on 
Durbin's method for computing the a(k)'s given the 
RCidJ's is presented in (26). The relative com- 
plexities of the serial and parallel algorithms 
are shown in Table 1. 

The two algorithms to compute RCiD's 


the “use 


N > M PEs. The interconnection functions required 
by the first algorithm are the Shift_, function 
and the Cube interconnection functions. The 


Shift_, is one of the uniform shift functions, 
where in general the Shifty, function is defined 


as the interconnection function which transfers 
data from PE i to PE (itd) mod N, 0 < i<N. The 
Cube functions [17] consist of n interconnection 
functions, defined for 0 < i <n as 


Cube. (p, _aeeePiee=Pp) = PraqeesPaseePq 


where Pa-qs**Pq is the binary representation of a 


PE address and - denotes complement. The second 
R(i) algorithm employs the n Cube functions and a 
Broadcast, defined to be the transfer of a data 
item from one PE to all PEs. The parallel algo~ 
rithm based on Durbin's method uses N > p PEs and 
requires the Logsp Cube functions, the Shift ,d 


functions for 1<d<p/2, and p/2 Exch, func- 
tions, defined for p/2 < d <p as 


jtd O<j<p-d 
Exch ,(j) ={j p-d<j<d 
j-d d<j<p 


In Section IV, the ability of various interconnec- 
tion networks to execute these interconnection 
functions is discussed. 

Under the assumptions of the covariance method, 
the predictor coefficients can be obtained by 
solving the system of equations (2,12): 


p 
¥ atkclk,i) = 1<i<p (5) 


k=1 


-c(0,i) 


The c(k,i)'s represent a covariance matrix 


M-1 
c(i,j) = S s(m-i)s(m-j) 
m=p | 


O<i,j <p (6) 


where samples s(0) through s(M-1) are available in 


the speech segment. Solution of the system of IV. MACHINE REQUIREMENTS 
equations in (6) is equivalent to solving the ma~ From the algorithms developed, it is possible 
trix equation: to infer some characteristics of an SIMD machine 
0 designed to perform LPC analysis efficiently. The 
a = -C (7) machine should have at least M PEs, needed for the 
fast computation of autocorrelation or covariance 
coefficients. A submachine of size p will be used 
to solve for the predictor coefficients. For the. 
autocorrelation method, the Cube, Shift, and Exch 
functions must be performed. A Broadcast may also 


where C is the p-element column vector of elements 
c(0,i), 1< i<p, and C is the p by p covariance 
matrix in which ((k,i) = c(k,i), 1 < i,k <p. 
Solution for the a(k)'s in the covariance method 
consists of two steps: computation of the c(i,j)'s ; 
and solution of the system of equations in (5) or be needed. For the covariance method, the Plus2 , 
(7). Parallel algorithms to compute the ci,j)'s, Minus2', and Shift functions and a Broadcast are 
perform the matrix inversion, and compute the used. It can be shown [26] that each of these 
matrix-vector product needed to solve for the functions and the Broadcast can be performed in a 


a(k)'s in equation (7) are given in [26]. Com- single pass through a number of multistage net~ 
plexities of the serial and SIMD algorithms are works, including the data manipulator [5], aug~ 
shown in Table 1. mented data manipulator [18], generalized cube 


The algorithm to compute the c(i,j)'s uses (with four-function interchange boxes) [22], and 
N >M PEs. The interconnection functions required omega £9] networks. In one pass, the indirect 


are the Shift_, and Shift_¢y_,) functions, the binary n-cube £15] can perform each of the func~ 


Shift,. functions for 1 < i <p, and the set of n tions; the effect of a Broadcast can be achieved 
ah a he ; in at most n passes, using a transfer pattern 
"Minus2'" functions. The Minus2' functions are similar to that of recursive doubling. The STARAN 
defined for 0< i <n as flip network [3] can perform the Cube, Plus2', and 
Minus2! G)=G- 21) mod N. Minus2' functions in a single pass, the Shift 
functions and Broadcast in at most n passes, and 
The matrix inversion algorithm uses N>p PEs, and the Exch in at most en passes. A single-stage 
requires a Broadcast, the Shift, functions for. shuffle-exchange network C27] can perform each of 
: td these data transfers in at most n shuffles and n 

1 <d <p, and the set of logsp "Plus2'" functions exchanges. More details are in [26]. 

- defined for 0 < i <n as V. CONCLUSIONS 

i.. ; ; Many large-scale multimicroprocessor systems 
Plus2 (j) = (j + 2) mod N. which can operate in the SIMD mode of parallelism 
The algorithm to compute the matrix-vector product have been proposed [e.g., 10, 13, 15, 20, 21]. 
uses N > p PEs, and requires a Broadcast. This paper explores’ the use of SIMD parallelism 


Table 1. Summary of serial and SIMD algorithm complexities for linear prediction computations. 


inter-PE types of 
multiplications additions transfers transfers 
AUTOCORRELATION 
METHOD: 
RCi)'s serial M(p+1)-p(pt1) /2 M(p+1)—-p(p+1)/2 - - 
SIMD - M PEs* p+ (p+1) Logm LogoM(p+1) +p Shift_y, Cube 
or SIMD - M PEs* Log m+1" 2<LogsM+1) * LogoM Cube, Broadcast 
a(k)'s serial p*+2p p*4p - = 
SIMD ~- p PEs Sp-2 plog.pt3p+log.p plogoptp/2+log.p-1 Cube, Shift, Exch 
COVARIANCE 
METHOD: 
° e a e 2 2 -_ 
c(i,j)'s serial Mpt+p -p Mpt+tp —p = 
SIMD - M PEs p+ (ptt) (Log .M+1) Log M(p+1)+3p+1 Shift, Minus2' 
matrix | 
; 3 35.2 n : | 
inversion serial p~ +p~-1 p-~2p tp 
SIMD - p PEs 2p° 2p*-p 3p*-2p+2log,ptt Broadcast, Shift, Plus2' 
matrix-vector 
product serial pe p* , = = 
SIMD - p PEs p p p Broadcast 


t 


* presented in [24] complex arithmetic 
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for the applications area of speech processing by 
discussing parallel algorithms to perform Linear 
predictive coding analysis. From the algorithms, 
design criteria for an SIMD machine for speech 
analysis applications can be derived. 

The approach taken to studying the applicabili- 
ty of SIMD machines to linear predictive coding 


has been to develop and analyze parallel Linear 
prediction algorithms. On one hand, these ana- 
Lyses provide direct information towards evaluat- 


ing the usefulness of parallel computers’ for 
speech processing and related areas. At the same 
time, however, they contribute to the more general 
body of knowledge concerning parallel processing. 
By developing algorithms for a general model of a 
parallel system, insight can be gained into a num- 
ber of aspects of parallel processing. The algo- 


rithms can be used to define specific architectur-_ 


al features, such as the number of processors 
needed/useful for a class of problems, the sizes 
of memories required, interconnection network 
capabilities needed, and the type of processing 
capability required in each processor. Thus, ap- 
plications studies such as this provide informa- 
tion for both the speech processing and the paral- 


lel processing researcher. 
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Abstract -- A pipelined version of the 
parallel Givens Reduction algorithm of Sameh and 
Kuck is developed that runs on a _ quad-connected 


ae multiprocessor array. With the addition of a 
shuffle transformation, this permits the solution 
of the n by n system Ax = fy i= 1,--,r in time 
C(n/p)* (naptr) « When A has band width p,_ the 
time is C(n/p)(ptr). This is used as the kernel 
for a pipelined nested dissection solver for the 
- by a algebraic systems that arise in finite 
element problems on n by n grids. With ann by n 
mesh-connected multiprocessor, the method runs in 
time C(n+rlog(n)). 


INTRODUCTION 
The method of Nested Dissection by George 
[1] has been’ shown by Liu [2] to be capable of 


solving the sparse eae system of equations Ax=f 
on an n by n finite element grid in time O(n) 


with 0(n*) processors. Like many of the now 
classical parallel algorithms, this theoretical 
result did not consider the problem of inter pro- 
cessor communication. Kung and lLeiserson (3] 
have shown that, by a "wavefront" or "systolic" 
process, the solution of a dense n by n system of 
linear equations can be pipelined on a hex- 


connected array of 0(n*) processors which carry 
out standard Gaussian eliminationh. The result is 
an explicit scheme of computation that includes 
inter processor communication within the standard 
O(n) time bound. More recently, Kuhn [4] has 
demonstrated a class of program transformations 
that map a great number of algorithms onto 


specific architectures with the property that the 
resulting execution time is of the same complexi- 


ty as the classical asymptotic time bound. 

Among the more stable dense system solvers 
is the method of Parallel Givens rotations by 
Sameh and Kuck [5]. As has been observed by Kung 
and lLeiserson, this method can be easily be 
adapted to this type of pipelined array environ- 
ment. The purpose of this note is to develop one 
such implementation of the Sameh-Kuck algorithm 
that runs on a quad-connected multiprocessor ar- 
ray, and to use this as the kernel for a nested 
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dissection solver that can be pipelined on a 
mesh-connected multiprocessor such as the 
N.A.S.A. Langley Finite Element Machine [6]. 

The following paragraphs will first treat 
the Sameh-Kuck algorithm as well as several other 
useful operations that can be executed on a 
square array mesh=connected multiprocessor. 
These operations will constitute a set of primi- 
tives from which the nested dissection solver 
will be constructed in the third section. The 
last section will discuss the significance as 
well as the shortcomings of such an approach. 


The Givens Rotation 
Consider the problem of finding the _ solu- 
tions to the system Ax = fis for load vectors fis 


i = 1..r. The approach taken by Sameh and _ Kuck 
is to reduce the matrix A to upper triangular 
form by the application of a carefully chosen se= 
quence of rotations. It is shown in [5] that 
this sequence of rotations can be blocked in such 
way that all rotations within a given block may 
be executed in parallel. Once A is in upper tri-~ 
angular form, an algorithm such as the column 
sweep method described in Kuck [7] can by used to 
complete the back solve process to obtain the 
solution vectors x. 

The basic Givens rotation of two rows i,j of 
the matrix (A, f) is given by 


a 


41°°? in f aieet £¢ c 6s asp 4g faye fae 


0 <<a £* er -S Cc a eed f eof 


jn jl jr| jl ja lr 


Assume that the i,j rows of the (A,f) array are 
made available to a pair of connected processors 
every C time steps. After receiving the first 
column, the processor pair must compute the rota~ 
tion coefficients given by 


a 
2 + Fe ae Cc » iL Ss 


sb 
(as) j d - 


d 


and then output (d, o)*. Assume the pair of pro- 
cessors can communicate one item of data in ei- 
ther direction in one unit of time. The algo- 
rithm for the pair of processors consists of the 
following sequences 


In processor Pl In processor P2 


input aay input a5y 

2 _ 2 
send d to P2 recv c from Pl 
recv c from p2 send d to Pl 
d :=dte d:=d+t+e 
d := sqrt(d) d := sqrt(d) 
c 3= a,,/d g 3= a,,/4 
send c to P2 recv c from Pl 
recv s from P2 send s to Pl 
output d output 0 


At succeeding blocks of C time steps 
the pair computes and outputs 


sh) 
I 


“a. = ca, + 8a 5, k = 2,-¢,n, 


= cf + oF i k 


ik | Tr 


in processor 1, and 


=-sa, a ee | k = 2,ee«,n, 


k 


cf 


=-sf, jk Ls eeyt 


vas 


in processor 2. From the first stage of computa- 
tion, it can be seen that C = 9 + one square 
root. At the expense of a more complex instruc- 
tion sequence for each processor, the square root 
free method of Hammerling [8] can be implemented 
(In this case, in order to prevent underflow the 
processor pair must make a test, and possibly 
switch the computational roles of the two proces- 
SOrs). 


Figure 1 illustrates the overall structure 
for the pipelined algorithm. It is assumed the 
matrix and right hand side vectors (A, F) are 


stored by rows in a column of memory units along 
the right hand side of the array of processors. 
One column is accessed every C time units. The 


data marches, column by column, through the array 
of processors. The pattern of zeros introduced 


Figure l. 


Givers Reduction Array. 
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into the matrix is given by the numbered ele- 
ments. When matrix element x reaches processor x 
the reduction sequence is started. The processor 
x and the processor marked by * above x carry out 
the programs described above for Pl and P2 
respectively. The unmarked processors simply ex- 
ecute a receive, wait, and transmit operation on 
all data they receive. In order to keep the rows 
synchronized as they move through the array, the 
wait cycle should be equal to Ce. As in figure l, 
an n by n matrix is reduced to upper triangular 
form after being passed through a triangle imbed- 
ed in a rectangular array of 2n-3 by n quad con- 
nected processors. 

The total time to preform the 
all n_ columns 


reduction 
of A _ plus the r columns f. 


on 
i= 
1l,-.,r is C(3ntr-3). The reduction operation is 
equivalent to a left multiply by an orthogonal 
matrix Q. If we let the upper triangular matrix 
be represented by U = QA, and let f" = Qt, i= 


1,+»-,r, the problem has been reduced to the solu- 
tion of the system 


ae 


Ux = i 


i= 1,..,r. 
A second pass through the processor array can be. 
used to execute a column sweep back solve algo- 
rithm as illustrated in figure 2. The initial 
storage scheme is identical to the reduction 
sweep. First the entries of U are moved into the 
array with a simple pipelined broadcast along the 
rows of the array. In the computational phase of 
the algorithm the rows of the f, vectors are ad- 
vanced into the computational array. The move- 
ment is not by columns but is stagered so that 
the first entry of the last row is entered first, 
then the first entry of row n-1 and the second 
entry of row n- In figure 2, the numbers’ beside 
the arrows indicate the order of the data move- 
ment for the first column as it passes’ through 
the arraye The activities of the processors fall 
into three categories. The processors containing 
matrix diagonal elements receive a data item from 
the right and divide by the diagonal element. 
The result is an x value and is transmitted both 
upward and to the left. The processors above the 


diagonal receive an x value from below which they 
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Input matrix. 


multiply by the contained matrix entry, and  sub- 


tract from the value received from the right. 
The difference is passed to the left and the x 
value is sent on upward. Processors below the 


diagonal receive x values which they transmit to 
the left. If the vectors f. i=l,..,r is viewed 


-l 
as. an. m by r matrix F, this process computes U F 


in time C(3ntr). In this case C, the maximum 


time for any processor to complete one cycle of 
its task, is about 4 instructions. 
9 8 7 6 5 
a See 0] ie) as Oy tie is A 2 
8 
8 4 
XT Moo MF O03 IT Mog I] M25 ( For F202 
64 4 
6 4 3 
x u + U u f£ £ 
3 33 34 35 31 32 
4 
4 2 
hq Byam Yas ag 842 
3 
2 1 
a aia Ns ae Ra 
Figure 2. Back solve. 


In a similar manner, one can compute the ma- 
trix product AB for ann by n matrix A, and ann 
by r matrix B. This is illustrated in figure 3. 
In this case, matrix A is initially stored along 
the top row and the matrix B is stored along a 
perpendicular edge. During the first phase the 
matrix A is moved into the array. In the second 
phase, the matrix B is piped through the array. 
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The result of the inner product is accumulated as 
the data moves vertically, and the values of B 
are piped horizontally. The result of the pro- 
duct appears along the top row. It should be 
noted that the above two procedures are simple 
adaptations of ideas that are not at all new and 
fall under the general heading of “column sweep’ 
or “wave front’ algorithms. 

For the purpose of a concise description of 


the dissection algorithm. we shall adopt the fol- 
lowing notation. Given ann by n array of pro- 
cessors connected vertically, horizontally, and 
diagonally to their netghbors, and ann by n ma- 
trix A and ann by r matrix B, define 
GR(A; B) = the Givens reduction process 

of forming (U, QB). 
BS(U; B) = the back solve process 

to form us. 
MP(A; B) = the matrix multiplication process. 


The notation (A, B) will describe the n by mtr 
matrix formed by the concatenation of the ma- 
trices A and B. If P is a row or column of pro- 
cessors the notation A, shall mean that A is 


stored in the memories of P one row per  proces- 
SOTLe The above processes all share the property 
that one starts with data along one edge of the 
processor array and the results are produced. 


along some other edge. The dissection algorithm 
shall have occasion to require the results of an 
operation to lie in the same processor memory set 
as those in which the data originated. For this 
reason define the process 


AR = MV(A,) = the pipelined copy 
of A from processor set P to 


processor set R. 


Note that the Givens reduction, as defined 


above, does not exactly correspond to the algo- 
rithm described previously. The difference lies 
in the size of the processor array required. The 


basic reduction process requires a 2n by n array. 
Rather than treat the problem of reducing the re- 
quirement to an n by n array, consider the more 
general limited processor problem of reducing an 
n by n matrix with a p by p processor array with 


p <ne- For simplicity, assume that p divides n. 
The more general case is not difficult to 
develop. The process of reducing a p by n matrix 


to upper triangular form by a height p_ processor 
array will be called p-reduction. Observe that 


by using a processors divided first into two 
height p/2 triangles, one can reduce a pxn matrix 
to the form illustrated in figure 4. This is 
simply two reduced p/2 by n matrices stacked 
vertically. The resulting process will be called 
a 2(p/2)-reduction. The problem is then how to 
complete a 2(p/2)-reduction to a p-reduction. 
The solution lies in the use of ae shuffle 
transformation to rearrange the rows of the ma- 
trix so that a second pass through the processor 
array will complete the task. For p even, the 
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Figure 4, 


The 2(p/2)-reduction. 
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Half Givens array. Half reduced array. Shuffle 2 (n/2) arrays reduced. 
. Figure 5. 
shuffle permutation is given by processors. The algorithm first completes n/p 
2(p/2)-reductions and the remainder of the blocks 
sh(i) = 2i 0<i es are reduced by the shuffle half reduction scheme. 
The order of elimination of the various block is 
sh(i) = 2i-ptl poi >t. given by the numbering in figure 6. A similar 
analysis gives the case of a band width p_ system 
F ; 2r 
By applying a shuffle transformation, the two in time Cn(3 p )s 


(p/2)-reduced arrays are merged to a matrix that 
is "half" reduced in the sense that the existing 
zeros correspond to those introduced by the right 
half of the 2p by p processor triangle studied 
above. Hence, the remainder of the reduction can 
be completed by using the left half of the height 
p processor triangle (see figure 5). Employing 
the block elimination scheme illustrated in fig- 


ure 6 for the case p/2 = n/4, the n by n system 
2 

can be n=-reduced in time 2c#> (p+ntr) using nf 
. P 


Figure 6. Block Reduction. 
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To complete the mapping of this process onto 


the p by p processor array, it suffices to ob- 
serve that the shuffle permutation is easily 
pipelined. This is illustrated in figure 7 for 


the case of p = 8. 


Figure 7. Pipelined Shuffle. 


NESTED DISSECTION 


The nested dissection algorithm is simply 
Gaussian elimination based on a very special ord- 
ering of the variables in the system. Initially, 
one starts with a finite difference or finite 
element problem defined on a n by on grid of 
nodes. The variables of the system correspond to 
the solution values at the nodes of the grid. 


The i by = matrix of coefficients in the system 
of equations corresponds to the pairwise interac- 
tions of the nodes on the n by n grid. The main 
idea behind the algorithm can be described as 
follows. First, from the system of equation, we 
eliminate all variables corresponding to the in- 
terior nodes of the square grid. What remains is 
a much smaller system involving only the nodes on 
the boundary. Once the smaller system has been 
solved, the boundary values obtained are used in 
a back solve process to obtain the values for the 
interior node variables. The important feature 
of the algorithm is the way in which we eliminate 
the interior node variables. Divide the n by n 
grid into 4 grids of size n/2 by n/2. Eliminate 
the interiors of each of these smaller grids (in 
parallel) and then eliminate the variables for 
the nodes on the cross that quartersected the 
grid. The elimination of the interior nodes for 
the subgrids is the same process, recursively ap- 
plied. By repeatedly subdividing, all interior 
nodes are eventually seen to lie on some _ subdiv- 
ing cross. In terms of the Gaussian elimination 
process, subdiagonal segments of the columns in 
the matrix corresponding to the smallest subdi- 
viders are eliminated in parallel first. Then 
the subdiagonal matrix elements corresponding the 
the next smallest are eliminated. The process 
continues until all subdiagonal elements for all 
interior nodes have been eliminated. The para- 
graphs that follow will give a more rigorous 
description of this process. 

The dissection algorithm is one of the very 
large class of algorithms based on a recursive 
divide-and—-conquer approach. These algorithms 


start with a problem of size a and generate 2 


problems of size gk-1 then 4 problems of size 
as and so on. Previous studies of parallel 


nested dissection have been to consider the ap- 
plication of the algorithm to a machine like the 
CRAY 1. The unique feature of a vector architec- 
ture is that only a few vectors can usually be 


' processed in parallel, and one wants these to be 


as long as possible. Unless one defines the vec~- 
tors used in the computation to be cross sections 
of the components cf the subproblems, the recur- 


sive algorithms very quickly generate ok vectors 
of length 2. Calahan has proposed a method [9] 
based on "generalized vectors" to deal with this 
problem. 

A second approach is to consider termination 
of the algorithm before the vectors get too short 
and then switch to a second method. The _ transi- 
tion point depends upon various factors such as 


} 
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_nine point finite difference operator, i-e. 


the pipe start-up time. George, Poole, and Voigt 
[10] have studied this trade off in great detail 
and have developed several very attractive algo- 
rithms. 

An architecture of the MIMD class provides a 
very flexible environment for the parallel imple- 
mentation of algorithms based on the _ recursive 
divide-and-conquer technique. In its simplest 
form, the N.A.S.A. Langley finite element machine 


is an array of a microprocessors that correspond 
to the nodes in a finite difference grid and are 
interconnected in a manner that corresponds to a 
each 
processor is connected to its eight nearest 
neighbors. In the local memory of each processor 
we can easily generate the rows of the stiffness 
matrix A and the load vectors £ that correspond 
to that node of the grid. For simplicity, let 
n=2*44 for some k>O. Number the nodes by (x;y) 
coordinate pairs on an integer lattice with (0,0) 
as the lower left hand corner. Let the processor 
at (x,y) be denoted by P(x,y)- Let a,b,s>0, and 


consider a (2°41) by (2541) block with P(a,b) as 
the lower left hand corner. In order to describe 
the algorithm it is helpful to define certain 
sets of processors and variables associated with 
such a block. Define 


PXS(a,b) = (p(at2°"", br) | 1<4<2°-1} 
PX,(a,b) = {p(at4, pt2sty | 1<4<28—1 4425775 
s 8 s 
PY, (a,b) = {p(ati, bt2 ) | 1<i<2 } 
PY, (a,b) = {p(at2°, bri) | 0<1<2°-1} 
PY3(a,b) = {p(ati, b) | O0<i<28-1} 
PY, (a,b) = {p(a,b+i) | 1<i<2°}. 
Let PN(a,b) be the set of processors in this 
block not contained in the above sets. In short, 


the set Px. is the set of processors on the vert- 
ical bisector. PX, is the set along the horizon- 
tal bisector less the center processor, and PY, 


i=1,..,4 is the set 
outer edges of the block. 
figure 8. Let N, Xi» Lf 


be the set of indices of the corresponding vari- 
ables where the superscript s and base address 
(a,b) are dropped when the context is clear. Let 
Z be the set of indices corresponding to the 
columns of f, i=l1,..,r and define A(H;I) to be 


corresponding to the four 
This is illustrated in 
with i=1,2 and j=l,..,4 


the subblock of the matrix (A, F) for rows H and 
columns I. Special blocks of interest will be 


XK.) 14, 462, 


ij 1? 
C, = ACKys Yj5-1,--,4, 2) te1,2, 
Di, 7 ACS X,) 1<i<4, 1<4<2, 


The main idea behind the implementation con- 
structed here is to map the dissection algorithm 
onto the array of processors so that when the 
grid is recursively quartered, the processors in 
each quadrant are assigned the task of completing 
the computation associated with the variables in 
that quadrant subgrid. 


py§ 
1 


pys 
4 


yb pys 
(a,b) 4 


Figure 8, Quartered Grid. 


The only problem to be solved is how to or- 
ganize the computation of the recursive reduction 
process in a manner that avoids memory and _ pro- 
cessor contention for processors associated with 
the subdividing cross. In order to visualize the 
process, consider the structure of the matrix (A, 


F) in a block of size 2°41 by 2°+1 after the el- 


imination of all bisectors of size 2871 and 


smaller (corresponding to the elimination of all 


variables a for r<s-1).- The structure of the 


matrix is shown in figure 9. 
N xy X, yy Yo Y, Yy, Z 


(A, F) Block Decomposition. 
Figure 9. 


The algorithm must reduce the matrix Bai and 


eliminate the subdiagonal blocks D,, for i,j=1,2 


and k=1,..,4. The nested dissection reduction of 
the interior variables of the grid can be 
described in the following algorithmic form. 


Procedure NDR(s,a,b) 
Begin 
if s>0O pardo 
call NDR(s-1l,a,b); 


call NDR(s-1,at2® 1,b); 
call NDR(s-1,a,b+2° +); 
call NDR(s-1, a+2°~1, b+2°/); 
odpar; 
/* move the Ba; rows to the outer 
edges of the array */ 
ae it VoyS(a,py) tna? 
M217 P227D ys cg yy zt n22r°28 


/* reduce the Bi matrix */ 


C,):=GR(B, 3B, 5.C,)3 


-=BS(U, 


(U1 >Bi a> 


(B45 6 -) $Bo ay); 
12°64 py9¢a,b) 13By 99°, 
(B-,,,C7,) : =MV(B- 

ai PY°(a,b) 


/* if X, is nonempty eliminate 
By, and reduce By 9 * / 


if X, <> 0 then 


:=MP (B “+B 


2133 1 29€ 22°°2)3 


:=GR(By5,C,)3 


fi: 


/* eliminate the D 
for i=l to 4 do 
(D,. »E, ) >=MP(D.,;B Ca) 
il *py8(a .b). il “12 1 
+(D 


x 
ij subblocks */ 


»D.)3 
od; 12° 4 
if X., <> 0 then 


for i= 1 to 4 do 
(E, ) >=MP(D.,;C~,)+£,; 
py’ (a, b) i2 2 i 
od; 
Lg a Be 
eend; 
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The first block of the algorithm is the parallel 
recursive call, and hence constitutes the only 
possible point of processor or memory contention. 
In fact, each call to NDR must be seen as an al- 
location of a square block of processors to that 
called process, and the blocks defined here over- 


py>. 


i The 


lap along the common boundaries only 
possible way contention can occur is if some 
operation requires a pair of opposite outer 
edges. In this case, the same process executing 
on a neighboring block will demand access to the 
processors along the common boundary, and a con- 
flict will develope. To see that no _ contention 
does develope, observe that at no point during 
the NDR procedure are there more than two perpen- 


dicular sets of edge processors involved in any 
one step of the process. The first set of moves 
are from the interior cross to an outer edge. 


The Givens reductions and the back solves start 
along an outer edge, but terminated short of the 
outer edge on the opposite side. The matrix mul- 
tiplications employ a pair of perpendicular 
edges. 

A time complexity estimate can be derived by 
observing that each operation in the procedure 
runs in time 0(2°+r). Hence, if Ty is the time 
to complete a call to NDR(s,a,b) then there ex- 
ists some constant C such that 


T =T 


s 
: ge (2° + 4) 


With n = oK41, a call from the top level would 
be completed in time 2C(ntrlog(n)). 
A top level description of 
would appear. as 


the algorithm 


begin 
call NDR(k,0,0); 


call the limited processor 
Givens reduction 
to reduce the 4n by 4n block 
corresponding to the exterior 


variables ye 


i i=1, eo4e 


call a limited processor 
back solve to obtain the 4nr 


variables Te 
call NDBS(k,0,0); 
end. 


Once the subdiagonals associated with the interi- 
or nodes have been eliminated the reduction and 
solution of the problem corresponding to _ the 
boundary variables become a straight forward ap- 
plication of the limited processor techniques 
described. earlier. The last call is to a nested 
dissection back solve procedure which is outlined 
as follows. | 
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procedure NDBS(s,a,b) 
begin 
define YV, j 


to be the 2° by r solution 


» XV 


values associated with PY, (a,b), PX, (a,b)- 


define Fy =A (X, 52) 


=A(X,5Y.)5 


Cis 
for i=l to 2 do 
for j=l to 4 do 
F oaE MEG, 2%) 


j 


od; 
2s BS(U,53F5)3 


F, := F, — MP(B 


1 1 


XV) 


OV px, := MV (XV, )5 


:= BS(U, )3F1)5 


(XV 


) := ; 
2B, VON 5) 


if s > 0 then pardo 
call NDBS(s-l,a,b); 


1 


call NDBS(s-1,at2° -,b); 
s-l 

call NDBS(s—l,a,b+2 ya 

call NDBS(s-1,at2° 1, b+2974); 

odpar3 

end; 
This procedure is of the same complexity as_ the 
reduction process and we see the time bound for 
the complete solution is of complexity 
O(ntrlog(n)). 
CONCLUSION 
The finite element machine was designed to 

solve finite element problems. While the most 
obvious application is to adapt iterative 
schemes, it has been show in the preceding sec- 


tions that the architecture is rich enough to 
Support the implementation of a direct solver 
that was also originally designed for finite ele- 
ment problems. The thrust of the preceding argu- 
ments has been to demonstrate that all interpro- 
cessor communication can be accounted for without 
changing the complexity of the time bound based 
on arithmetic operation counts. Furthermore we 
have shown that by pipelining the multiprocessor 
one can solve the system for several right hand 
Sides at little additional cost (Crlog(n)). But 
concerning the applicability of such a systen, 
several important points should be raised. 
While the algorithm is asymptotically 
fast as a parallel direct solver applicable 
wide variety of problems, the size of the 
stant C may exceed the equivalent factor 


very 
to a 
con- 
on a 


method like SOR or other iterative schemes by a 
factor of 10 or more. On the other hand, these 
other methods require more processing to deter- 
mine factors such as relaxation parameters in 
order to run a their optimal rates. Furthermore, 
it is not clear how the iterative schemes can be 
effectively pipelined to solve a system for 
several right hand sides without simply multiply- 
ing the time bound by r. 

| A second interesting point concerns the nu- 
merical properties of the algorithm as it is 
presented here. The method is a blend of Givens 
reductions and direct elimination. While one 
would expect that this should work as well as 
direct elimination alone , more work is needed to 


verify this belief. 
A third point of interest lies in the prob- 


lem of generating code for a large multiproces- 
sore It is important to note that the procedures 
described above do not represent the code 
resident in any single processor. In fact, no 
two processors will execute the same code se- 
quence. One approach to generating the code 
would be to 
generic subroutines, such as the basic routines 
for the Givens reduction described earlier. 
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SOLVING LINEAR ALGEBRAIC EQUATIONS ON A MIMD COMPUTER 


R. E. Lord, J. S. Kowalik, and S. P. Kumar 
Department of Computer Science 


Washington State University 
Pullman, WA 99164 


Abstract -- Two practical parallel algorithms 


for solving systems of dense linear equations are 


presented. 
and Givens transformations. The algorithms are 


numerically stable and have been tested on a MIMD 
computer. 


Introduction 


The problem of solving a set of linear alge- 
braic equations is one of the central problems in 
computational mathematics and computer science. 


Excellent numerical methods solving this problem 
on uniprocessor systems have been developed, and 


many reliable and high quality codes are available 


for different cases of linear systems. On the 
other hand, the methods for solving linear equa- 
tions on parallel computers are still in the 
conceptual stage, although some basic ideas have 
already emerged. The current state of the art in 
parallel numerical linear algebra is well de- 
scribed by Heller [3] and Sameh and Kuck [5]. 


Our investigation of methods for solving 
systems of dense linear equations on a MIMD 
computer includes Gaussian elimination with 
partial pivoting and Givens transformations. The 
first algorithm is commonly used to solve square 
systems of equations, the second produces orthog- 
nol decomposition used in several problems of 
numerical analysis including linear least squares 
problems. We focus our attention on the cases 
where the number of available processors is 
between 2 and O(n), n being the number of linear 
equations. We take the view that it is not 
presently realistic to assume that O(n2) proces- 
sors will be soon available to solve sizable sets 
of equations. To verify our analytic results we 
have used a parallel computer manufactured by 
Denelcor Co. [6]. This computer, called HEP 
(Heterogeneous Element Processor), is a MIMD 
machine of the shared resource type as defined 
by Flynn. 


Gaussian Elimination 


If we consider a step to be either a multi- 
plication and a subtraction, or a compare and 
multiplication, then sequential programs for pro- 
ducing the LU decomposition of an nxn non-singu- 
lar matrix requires T, = (n3/3) + O(n*) steps. 
The parallel method using p = (n-1)¢ processors 
and partial pivoting requires T, = O(n log n) 


steps. Thus the efficiency of such method for 
large n will be 

E = a = Bee ons 

p Ty ° Dp O(log n) ° 


CH1569-3/80/0000-0205$00.75 © 1980 IEEE 


They are based on Gaussian elimination 
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Even if the cost of each processor in a parallel 
system is substantially less than current proces- 
sor costs, this method will be economically un- 
feasible for n sufficiently large. We further 


observe that parallel computers which are or soon 
will be available will not provide n* processors 


for reasonable values of n. Thus, we restrict 
our attention to the case where the number of 
processors is in the range from 2 to O(n). 


The algorithm which we present provides the 
LU decomposition of (an nxn non-singular matrix 


A using from 1 to | 2 processors and has an 


efficiency of 2/3 when P =| 2}. 


Consider the sequential program for LU de- 
composition with partial pivoting given in Fig.l. 


Program LUDECOMP (A(n,n)). 


For k + 1 to n-l ao 


Find & such that 
|AC2,k) | =max(|A(k,k)|,..., [A(n,k) |) 
PIV(k) « & {pivot row} 
ACPIV(k) ,k) <> A(k,k) 

ec + 1/A(k,k) 

For i<« k+l to n do 

For j « k+l to n do 
A(PIV(k),3) <> A(k,j) 


For i = ktl to n do Tek 


JAG 9) + A(i,j) ~ A(i,k)*A(k, j) 


Fig. 1: 


Program for LU decomposition with 
illustration of tasks. 


In this program we shall consider a task to be 
that code segment which works on a particular 


column j for a particular value of k. We will 
denote these tasks by J = {TP | l<k<j<n, k<n-1}. 


The precedence constraints imposed by the 
sequential program are | 


j Q 
Sts = {(t2, TD | 4<2 and k=m, or k<m}, 
Thus, C = (J,<*) is the task system which repre- 


sents the sequential program (Coffman, Denning 
[1])}. The range and domain of these tasks are: 


R(T?) {A(i, 4) |ks<isn} 


D(T?) {A(i, 5) |ksisn} [jy {ACi,k) |ksisn} 
and from this we can observe that, for example, 
ive ce 
bead tasks and could be executed in parallel. 
More specifically we observe that C' = (J,<*') 
where <*' is aes transitive closure on the 

J) In<4< j 73 <j<n} i 
{( is ma +) |k j sn} UT» Tap Ls j<n} is 
a maximally eee system equivalent to C. This 
system is illustrated in Fig. 2. 


seeesTy} are all mutually noninter- 


relation X = 


Fig. 2: 


Maximally Parallel Task 
System Equivalent to C. 


Given the task system C' we now determine 
the execution time of the tasks and from that 


determine a schedule. We assume that one multiply 
and one subtract, or one multiply and one compare 


constitute a time step. Thus, neglecting any 
overhead for loop control, the execution time 


w(T:) for each of the tasks is given by: 

ntl—-k if k = j 

Treating the task system C' together with w(T}) 
as a weighted graph we observe that the longest 
path traverses the nodes: Eta: ie se --~, 
Tee ag We will denote this path as s, and 
the length of the path by L(s,). 


n-l . 

L(s,) = ntl +2 ) j = n*-1 
1 
j=2 

Since the problem cannot be solved in time 
shorter than this path length we developed a 
schedule where the tasks constituting s, are 
assigned to processor 1 and the remaining tasks 


are assigned to [5] - 1 additional processors. 
Processor 2 will execute the tasks ene Ts ste, 


Gest and, more generally, processor j will 


+ . Li 
esecoee the: cake ree Td qed pitt pt 1 


1 Loe 2 >"? “n-2 (j-1) 
and we will denote this as s,. Note that this is 
not a path through the graph? Since this schedule 
has length n*-1, the length of the longest path, 
then this schedule is optimal for n/2 processors. 
Using this schedule we note that: 


Pe 2 
S — + O(n’) 

li im —2 = lim + = 2 
nro P neo (n Sa/9 


and this efficiency is achieved to within 24 for 
relatively small n (nm 250). 


We now examine the question as to whether a 
schedule of length n2-1 is achievable with p < n/2 
processors. From the task system C as illustrated 
in Fig. 2 we note that task T; is a predecessor to 
all tasks and has an execution time of n steps. 


Consequently, any schedule for this system will 
have only one processor doing work during the 


first n steps. Similarly, sae is the successor 
of all tasks and thus during the last time ot 
only one processor can be doing work. Task T" 


has all tasks except ae tu ttp|isiso-t) as 


—l 
n-2 


-] 
predecessors, task i -1 is a successor task and 


for the tasks (r|1<j<n-t} each is a successor 


or predecessor of all other tasks in the set. 
Thus, for any schedule from the time that 2S 


commences execution, no more than 2 processors can 
n-jtl 
n-Jj 
commences execution no more than j processors can 
be doing work. From this, we define the "compu- 
tational area" of any schedule C to be the product 


of the number of processors and the schedule 
length less the area where not all the processors 


Since 5 Ga = 


each time interval of length 25 at most j proces- 
sors are working, we have: 


be doing work. By similar argument, once To 


can be doing work. ) =j and during 


p-l 
(n2-1)p - (p-1)n - 2 ) (p-4)4 - (p-l) 
j=2 


CA 


(n7-1)p - (p-1) (n-1) - (p?=p)/3. 
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The total amount of work (sum of the task weights) 
for the task system C is TW = (n3/3) + (2n/3)- 1. 
Thus, a lower bound on the number of processors 
required to achieve a schedule of length n“-1 is 
the smallest p for which CA > TW. For small even 
values of n the minimum p values are: 


2<n< 8 p = (n/2) 
10 < n < 14 Pp = (n/2)-1 
16 <n < 22 p = (n/2)-2 
24 Sn < 28 p = (n/2)-3 
30 <n < 34 p = (n/2)-4 
36 <n p < (n/2)-5 


For large values of n let p = an and determine a 
such that 
lim(CA/TW) = 1 


noo 


Thus, an a to patiety the above limit is a solu- 
tion to 30 - a = 1 and an approximate solution to 
this is a = 0.34729. We note that this is only a 
lower bound and we do not know if it is achievable 
in general, however, for n=10 we have found a 


schedule of length n*-1 using (n/2)-1 processors 
and for n=16 a schedule using (n/2)-2 processors. 


Should this lower bound be achievable then 


the efficiency for large n and using dn proces- 
sors would be 


n?/3 oe 
(arson 


lim (S /p) = = 0.9598. 
nooo P 


Acutal Performance 


The achievable schedules previously discussed 
were programmed using HEP FORTRAN and were exe- 
cuted on the HEP parallel computer. Although the 
program provided solutions to a set of linear 
equations, we present timing only for the LU de- 
composition part of the solution so that it may 
be compared with our predicted results. Due to 
memory limitations of the machine to which we had 
access, we could only run programs with n< 35 and 
l<ps<8. Table 1 gives the achieved results 
together with a comparison of the predicted 
results. 


Although the actual results are limited by 
the restriction on the maximum value for n, we 
feel that the agreement between actual and pre- 
dicted performance is sufficiently good to give 
credibility to our model of the algorithm's per- 
formance and that the efficiencies are high 
enough to support the conclusion that parallel 
methods for solving linear equations are a viable 
alternative to sequential methods. 


Fast Givens Transformations 
To solve the square system of equations Ax=b 


using the fast Givens transformations, due to 
Gentleman [2], we proceed as follows: 


Qo 6 oes 3G 


(i) The matrix A is kept in the factorized 
form A = pi/2, where D is a diagonal 


matrix. Initially D = Ton? B=A where 


n is the number of equations. 


(ii) Triangularize the matrix A by applying 


Givens rotations to the augmented matrix 
[A,b] and obtain the factors Q, D, R and 


6, such that 
1 1/2 A 
Qfa,b] = apt! te eh 
where R is upper triangular, Q is the 


product of the orthogonal transforma- 


tions used in the triangularization, and 
D is diagonal. 


73] = D 


(iii) Solve the upper triangular system Rx=b 
by back substitution. 


The sequential method of orthogonal triangu- 
larization of A, eliminates the subdiagonal non- 
zero elements of A one at a time. The elimina- 
tion process is performed sequentially by applying 
Givens plane rotation to A in such a way that the 
previously introduced zeros are not destroyed. 

For each column j of A, n-j rotations are required. 
This can be accomplished by algorithm 1: 


for i<«1l to n-l do 


for j< itl to n do 
GIVENS (i,j) 


which reduces A to ji/2p = QA, where Q = 


Pin2? 
Py goreeeePy merry i is the product of the 
bd) 3 > 


n(n-1)/2 Givens plane rotations. GIVENS (i,j) is 
a subroutine which constructs and applies the 


plane rotation P, .. The matrix P, , rotates the 
1,J 15J 


rows i and j and annihilates the element in the 


(i,j)-th position. The entire process requires 


SF 2: 229 F ; ; 
— - —n arithmetic operations. 


In a parallel implementation of the fast 
Givens method more than one plane rotation could 
be applied concurrently. Sameh and Kuck [5], and 
Kowalik et al. [4] describe details of such 
schemes which assume tnat p = 0(n“) processors are 
available. The algorithm proposed in Kowalik et 
al. [4] produces the orthogonal matrix Q = 5-3? 


Q, 47777959, where Q = {p, ,li<3 =n We eer 2 


itj = kt+2}, k = 1,2,...,2n-3, and P, j are 
applied in parallel. ‘ 


For the purpose of this analysis and imple- 
mentation we assume that the number of available 
processors is p = (n-1)/2 where n is odd. We also 
assume that every Givens rotation is performed 
sequentially, however, more than one rotation 
could be performed in parallel. 


We derive now a parallel scheme to trian- 


gularize A from the sequential method given in 
algorithm 1. 


i 
j 


Let a task T in algorithm 1 be defined by T 
GIVENS (i,j) where GIVENS(i,j) performs the 


following calculations: 


Ll. a = -B(j,i)/BCi,i) 

2. 8 = -(D(j)/D(i))*a 

Bu y = 1- ag 

4, Di) = (1/y)DCi) 

5. Dj) = (1/y)DG) 

6. BCi,®) = BCi,2) + BBCG,%) 

7. BG) = BG,&) + oBC,2) > 


Periodic rescaling of D and B to prevent under- 
flows and overflows, and row interchanges for 


numerical stability are included in our imple- 
mentation of the Givens routine. The precedence 
constraints on the set of these tasks 


Jos (r,{isisn-t, i<j<n} 


imposed by algorithm 1 are given by 


A (ryt, i isisn-2, i<j<n-1} 


i itl 
U (Ty) [bs is ne2h* 


where * represents the transitive closure of the 
set. Thus the system C = (J,<*) is the task 
system representing the sequential program. 
range and domain of these tasks are: 


The 


RTS) = (D(4),DC4) BCL, 2) BG2) [is &Sn) 


d(T) (D(4) ,D(4) ,B(4, 2) ,B(j,2) i<2<n). 


From this 
L<i<n-1, itj = k+#2, k = 1,2,...,2n-3} are 


mutually noninterfering tasks and can be executed 


in parallel. Hence we obtain a maximally paral- 
lel task system C' = (J,<*'), where 


i 
we can see that the tasks tT, en. 


oe e Reef i itl . a * 
< [(T;Ty44) U (1,7; )|1sisn-2, i<j<n-1] 
is equivalent to C. 


This maximally parallel task system C' is 
shown in Fig. 3. We now assume that one arith- 
metic operation constitutes a time step. Thus 


the length of T, is L(T) = 4(n-itl) + 7 steps. 


The longest path in this maximally parallel task 


system is s,° pel astes acest ss and 


the total length of Sj) is 


(4n+7) (n-1) + (4(n-1)+7) + 
(4(n-2)+7) + 2... (49247) — 


L(s,) 


6ae + 8n - 25 operations. 


Fig. 3: Maximally Parallel Task System C'. 


To execute our task system with p = (n-1)/2 


processors we have selected a scheduling scheme 
called ZIGZAG, shown in Fig. 4. According to 


this scheme the processors p,, k = 1,2,..,(n-1)/2 


' are assigned to the tasks as follows: 
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| nee See ane n-2 n-2 n-l 
P, executes: {T,sT,,T3sT,s+++s aa } 
| ; | ee eee n-4 ,n-4 ,n-3 
P, executes: {Ty sTasTasTeseeeoT pol, Th } 
° i et 2 2 n-2j+1 
ii T b] 9 ee ey f 
P, executes { 25 TO 441 Toye? 2442, TT } 
i a ee 
Pi, executes: {T _yetiet. 
2 
For this schedule the speedup and efficiency are: 
4 3 2 4 3 
T — — 
S = au = eka 3 = 2n 
PY gar 0G) 6n” : 
S 
Z a ps 2n , Z eo 4. n 
p 9) 9 nl 9 n-l 


and for sufficiently large values of n, E = 
0.444... . p 


Computational Results 


The ZIGZAG scheme for orthogonal triangu- 
larization shown in Fig. 4 was programmed and 
executed on the HEP parallel computer. Due to 
the present memory limitations the program was 
run for the values of n not exceeding n = 29. 
Since for this machine 1<p<8, and we assumed 
that p = (n-1)/2, the obtained numerical results 


Fig. 8: Parallel Zigzag Scheme for n 


up to n = 17 are useful to compare. The actual 
and predicted speedups and efficiencies of the 


algorithm for different values of n are shown in 


Table 2. The differences between the predicted 


and actual values of es and e are due to several 


factors: machine overhead, approximate count of 
arithmetic operations involved in Givens rota- 
tions, and data dependent number of scaling 
operations in the GIVENS routine which are not 
included in the operations count. 
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: Actual and Predicted Speedup and Efficiency. 
Time is measured in seconds. 
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OPTIMAL INTEGRATED-CIRCUIT IMPLEMENTATION 
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OF TRIANGULAR MATRIX INVERSION 


Franco P. Preparata 


Coordinated Science Laboratory 
University of Illinois 

URBANA, Illinois 61801 

U.S.A. 


Abstract 


We describe a class of integrated- 
circuit implementations of algorithms for inverting 
an n X n triangular matrix. These networks have 
area A and time T, with an area x time“ product 
AT? = o(n*) for all values of T such that O(1og?n). 

<T<0O(n)-Since there is a simple reduction of 
matrix multiplication_to inversion of a triangular 
matrix, and Savage fe has given an AT“ = Q(n*) 
lower-~bound for n x n matrix multiplication, the 
presented networks are asymptotically optimal in 
the VLSI model. 


Keywords : WLSI, matrix inversion, triangular ~ 
matrices, area-time complexity, pipeline computation, 
optimal networks. 


1. Introduction 


Increasing attention has been paid recen- 
tly to the design of networks for the direct imple- 
mentation of several interesting algorithms using 
the integrated-circuit technology (VLSI) ; particu- 
lary, combinatorial and numerical problems have been 
the target of these investigations [1-4 ]. 


Among numerical problems, several workers have 
directed their attention to matrix computations 
1,2,5], and,as regards the design of networks, 
have found that the mesh interconnection of compu- 
ting modules is particularly attuned to this class 
of problems, leading to optimal realizations [5,6 


in the VLSI model [7,8]. 


en this paper we consider the problem of 
designing VLSI networks for inverting a non singular 
triangular matrix. The design complies with speci~ _ 
fications of the VLSI model of computation recently 
proposed by Mead, Conway, and Thompson [7,8 ; 
and further refined by Brent , Kung [3]. 


This work was partially supported by National 
Science Foundation Grandt MCS-78-13642, and by the 
Joint Services Electronics Program Contract NQO0O14- 
79-C-0424, and by ERA 452 "Al Khowarizmi", Centre 
National de la Recherche Scientifique, France. 
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In this model, the network is a computation graph 
consisting of nodes (processing modules) and wires. 
Wires have unit width and are partitionable into 
two orthogonal sheaves. A data item takes a unit of 
time to propagate along a wire from node to node 


( processing time is thus absorbed into propagation 


time). A.. mathematically natural complexity metric 
is the area x time“ product (AT2) , which embodies 

a trade-off between production cost (chip area A) 
and incremental cost (time T). 


Within this model, Savage [6] has recently 
proved the following interesting result : any VLSI 
design for the multiplication of two n x n matrices, 
with chip area A and computation time T, must sa- 
tisfy the Bound AT22 C n4, for some constant C. 

In (51 the authors demonstrate the existence of 
VLSI networks for multiplying n x n matrices with 


at? = O(n") for any computation time T in the range 


logon. <T<n. Note that. an. AT22 C'n4 bound also 
holds for the problem of inverting a non singular’ 
n x n triangular matrix, since matrix multiplica- 
tion is reducible to it ; the straightfor ward re- 
duction is based on the fact that the inverse of 
the 3n x 3n triangular matrix 


I A O I-A AB 
O I B is O I -B 
0 0 Ll 0 O I 


i.e., it contains an n x n block equal to the 
product AB. 


This paper is organized as follows : In 
Section 2 we present a general scheme for inverting 
an n. x n triangular matrix, and evaluate two net- 
work implementations, corresponding respectively 
to block-partitioning the matrix and choosing extre- 
me values for the block size in the allowable range. 
These two inverters are referred to as "recursive" 
and "systolic’ respectively ; with respect to the 
AT“ measure, only the latter is optimal for T = O(n). 
In Section 3 we show that the recursive and systolic 
inverters can be combined to build networks, called 
‘mixted"” inverters, which meet the optimal AT =2(n4) 
bound for all values of T such that O(log2n) <TsO(n). 


earner ee ert SOAPS RA et PORE tet RA ART ct ON A RR I A REECE Sn PORE eR TERIAL NNR Ae Ae 


triangular matrix. 


Let A be a nonsingular n x n triangular 
matrix) to be thought of as an n/s x n/s matrix 
whose elements are s x s blocks of the original 
entries (s is a parameter in the range [1,n/2}) : 
let As be the (i,j) block of A(i,j=1,2....,n/s) 


and let ogee be the corresponding block of eo 


It is straightforward to verify that 


G1) - 
Aes. 2S JRL A.. 
11 Lt 1] 
(-1) (-1) (-1) ce) 
Aty* Aay ag apa ae “SY 
for 12] 
ae 
j7!,] 


This general formula will now be specialized to two 
interesting cases. 


2.1 Recursive inversion 


The standard scheme for the_parallel 
inversion of a triangular matrix|9,10] corresponds 
to specializing the general scheme to s=n/2. 

In this case the inverse of 


-| - 
. : . A AL oo2 
| (2) 
22 Ago 


This immediately suggests a recursively defined net- 
work, containing two inverters of n/2 x n/2 triangu- 
lar matrices (to be used to compute A‘ and A in 


22 
parallel) and a network for the parallel multipli- 
cation of two n/2 x n/2 matrices(to be used to 


al | “1 , 
compute (A, Ay Ada in the order shown by the 


parenthesization). In figure 1, we show a possible 
layout for such a network. Each line shown carries 
n2/4 operands in parallel and the shaded surfaces 
are buffers of area (n2/4) ; the core of the 
circuit are two multipliers of two (n/2) x (n/2) 
matrices, of a type described in [51], and called 
recursive multipliers. Each of these multipliers 
hgs height and width respectively proportiona] to 
n“ and computes a matrix’ product in O(logn) time 


units. Due to the recursive definition of the in- 
verter, 


(Othe entries of all matrices considered in this 

paper are assumed to be drawn from a finite ring, so 
that an elementary finte chip can be used for multi- 
plying and adding entries in constant area and time. 
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Figure 1. Layout of the recursive matrix 
inverter ; Shaded boxes are data 
buffers. 


a simple argument showe-? that its height and 
width are also respectively proportional to n2 
also, the computation time is 0((logn)2). Note 
therefore that for the matrix inverter being 
described called recursive inverter - we have the 
following properties 


(3) 


| Recursive 
| inverter 


| o(n*t1og*n) 


Hf omermnamavmpn nn rem ist Serna ra 


Note that AT? is short of the optimal a(n‘) by a 
small order factor O(log’n). 


2.2. Systolic inversion 


The next scheme to be described corres- 
ponds to the choice s=1 in the general method. The 
resulting network is a mesh of processors, each of 
which feeds data in and out, each time performing 
some computation, keeping a regular flow in the 
network. Such networks have been called systolic 
by Kung an Leiserson!1]. | 


mre 


(2) 

Recurrences defining height and width are of the 
form f(n)<f(n/2)+An2 for some constant A ; the so- 
lution satisfies f(n)< 4 An*¢. 


Ls 
Ay AA? 


With our choice of s, block Ak in (1) 


becomes entry a,, (and similarly i) becomes a yi 


hk 
The form of (1) suggests a computation method on 
an n X n square mesh (figure 2). Only the upper- 
triangular positions in this mesh need contain 
processing modules (i.e., denoting by Mi the 


module in position (i,j), ae is deployed only for 


jSi). Modules are of two types with different com- 
putational capabilities : D-modules and M-modules, 
placed respectively in diagonal and off-diagonal 
positions. 


M-modules 


Figure 2. Generalstructure of the systolic matrix 
inverter (triangular mesh) 


Each module contains an operand register 
R, and input/output ports referred to by means of 
the compass points N,E,S,and W; the instructions 
executable in either type of module are shown com- 
pactly in figure 3. 


N N 


W E 
R< I/R ; 
E<«N<R., S 
initialisation step : 
R<«W.R;E<W $5 
general step ; 
R R+W. S; E <W; N<S; 
final step : 
R« -— R.S3 E<R; N<S;5 
D-modules M-modules 


Figure 3. Input/output structures and instruction 
sets of D-modules an M-modules. 


Initially, each entry a., (i<j) is read into regis- 
ter Rof module a J 


The first module to be activated is M,], 
(-1) 


which computes ary 


= I/a,, in,R ,Sends the result 


to Mo and activates M All D-modules perform the 


a 
same function upon activation : invert the as 
entry, broadcast it eastward to M. i+] and activate 
bd 
itl ,itl’ 


As for off-diagonal ee they accu- 


mulate the inner product ) a., . a. in their 


isk<j ik os 
general step, and transform it to ae 
-) Pe a.) X a! in their final ste 
ik *kj ii Ps 


isk<j 
R= 1) 
ik 

are transmitted eastward along horizontal lines, 


For the purpose of the general step, entries a 


while entries a,. are transmitted northward along 


kj 
verticals lines. A timing argument shows later that 
(-1) : ss : 
aes and aie meet in are for k=1,...,j-l1. Module 


. ; : : -| ; 
Ms; 1s thus activated when it receives a.. on its 


west entry port and it proceeds with its initia}i- 


zation step : sending ee northward, passing ass 


eastward ,accumulating an! a7 his register R»and 
entering its general step. During the general step, 
; ~| , 
M.. recelves a. d . on its W- = : 
ij e es a., an a t and Srentry. ports; 
it dutifully passes them on E-and N-ports accumula- 


ting Re-Rta., ‘ ae It enters the final step when 


i : -] : 
1t receives ‘ee on its S-entry port ; next the S 


: ; -| P 
input 1s passed northward, the result a. kept in R 
and also transmitted eastward. J 


To ensure that timing is correct, we can 
verify that : 


- Module M is activated at time j ; 
- M - modules Ms are in their general step 


from time j+l to 2j-i-1 ; their final 
step occurs at time 2j-i. 


- entries in and a_. reside in M.. at the 
1p Pj 1j 
p - it+j step ; 
- entry Pe arrives from S in M.. at 
JJ 1j 
time 2j-i. 


For clarity, in figure 4(a) we illustrate 
the timing of the computations : Each module is 
labelled with an integer which denotes the step 
at which computation in that module is completed. 
Also, in figure 4(b and c) we present snapshots 
of the data participating in the horizontal and 
vertical flow, respectively, at step 7. Clearly 
the calculation of A‘ is completed in 2n-1 
steps. 


213 


F3- S:F Ga je ke Se SOE ge tig sgt vane ee te 
2 46 ces 4.2 Ses ge ee 
Be D7 s'- te ES . «AX X 
46... ee f x 
a meee : Pe 3 
O/ vas = x 
Tide . 
(a) . (b) (c) 


Figure 4(a): timing of completion of computation 
up to step 7. 
(b): data (x) participating in horizontal 
flow at step 7. 
(c): data (x) participating in vertical 
flow at step 7. 


According to our original assumption 
that both the area of the processing modules and 
the time needed to execute any of the prescribed 
operations be bounded by a constant, we have the 
following : 


systolic 
inverter 


1.e., the network is optimal for the AT? measure. 
The optimal behavior, however, is achieved only 
for T = O(n). An interesting question is whether 
it can be extended to a wider range of processing 
times. This question is: addressed in the next 
section. 


3. Mixed networks 


We now describe how to combine the recur- 
Sive and systolic inverters described in the pre- 
ceding section in order to improve the AT“ measure 
for a wide range of the time parameter T. 


The resulting networds -to be called 
mixed- have the following general structure. A 
mixed network is a systolic scheme, as the one 
described in 2.2, where the "operands", rather 
than being elementary entries, are blocks of s x s 
such entries. In the corresponding n/s x n/s 
triangular mesh (see figure 2), the modules must 
now be designed to process s x s blocks. The 
layout of mixed networks is chosen as in figure 5,. 
where the modules themselves have been conveniently 
assumed to have a rectangular shape on the chip 
(else, we consider the smallest rectangle with 
sides parallel to the coordinate axes which con- 
tains the module). From figure 5 it is clear that 
while the dimensions (width and height) of the 
M-module determine one dimension of the network 
~say, its width-, the other dimension -say, its 
height- is determined by the larger of the 
corresponding values for the D- and M-modules. 
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Figure 5 


: General layout of mixed networks. 
We ean design a mixed inverter as fol- 
lows (Called Type-1 mixed inverter : 


(1) D-modules ate recursive inverters, as 
described in Section 2.1 ; they have 
width and height proportional to s2, and 
computation time O(log“s). 


(2) M-modules are recursive matrix multipliers 
of the type shown in [5], as already uséd 
to build the recursive inverter. They 
can be placed on the chip so that their 
width and height are both O(s2). Their 
computation time is O(logs). The follo- 
wing point must be noted : aithough this 
type of multiplier is completely pipeli- 
nable, i.e., it can complete ans x 5 
matrix product at each step, we cannot 
take advantage of the property since the 
term ach 

ij 
M; only O(logs) time units after the 
(-T) 
j 


eastward transmission of A. “1 
3 


1s available on the E-port of 


(indeed 


CP) (-1) (=1) 
Aig acie<g “ie AGG 
(-1) 


the multiplication A. .’,.A. . must 
L,j7~l i-l,j 


5 be@ey 


be completed before the final step of 
module Mis may begin). 


Since the heights of the D- and M-modules’ are of 
the same order, the height of the networks is 


oe peo = O(ns), and the same holds for the 


width of the network. Thus, A = nica ae and the 
smallest containing rectangle is nearly a’ square 
with both sides O(ns). As regards computation 
pM cee 
computed in time O(loe*sy, and, after this, 
the mesh computation begins. We have shown in 


time, the blocks A. ,yn/s) are all 


Section 2.2 that the systolic-network completes its 
computation in O(n/s) steps, whence the total 
computation time is T = O(logés +2 logs). If the 


we bounded the parameter s by s < n/logn we obtain 
T= oe logs), and the performance of Type-1l 


networks is summarized as follows 


Type-1l mixed 42 
inverter O(n logs) 


for cons. <s< n/logn 


The second kind of mixed networks 
(Type-2 mixed inverter) is constructed as 
follows 


(1) D-modules are type-1 mixed inverters 
(for s X s matrices). According to 
the preceding discussion, for any value 
of a parameter r < s/logs, these modules 
have height and width both O(sr) and 


computation time O(log2r += logr). 


choosing r = s/logs we obtain height 
0(s4/ ogs), width 0(s*/logs), and time 
O(log’s). 

(2) M-modules are pipelined matrix multiplier, 
as introduced by Preparata and Vuillemin 
r5]. It is shown in [5] that one such 
multiplier can be designed with height 
and width both 0(s2/logs) and computa- 
tion time O(logs). 


Again, the dimensions of both D-modules and 
M-modules are 0(s*/logs), whence 


a 52 Z nes 


log s 


With respect to computation time, we obtain the 
same conclusions as for type-1 mixed inverters, 
Ie ui 
2 n L 
T = O(log Ss + e Ogs) . 


Therefore we obtain 


ii 


2 : 
AT? = 0 (= + (log’s + 2 aga)“ 


log s 
2, 2 
02. ss der sxG a 5 tons) 
n 
log s s 


0 (2!-« + = ees . 
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Obviously, if s < n/logn we have 
slogs < n, whence AT2 = O(n4y, and the perfor- 
mance of the Type-2 mixed inverter is so 
summarized : 


Type-2 mixed > 
inverter 


const. <s < n/ (logn)” 


Since as s varies from a small constant 
value to n/(logn)? the computation time T varies 
from O(n) to Os tee n). we can design networks 
meeting the AT“ = 0(n4) optimal bound for all T 
such that O(log2n) < T < O(n). Identically, even 
in totally unrestricted models of computation 
-as the shared-memory-machine | see, for example 
T1O0]]J- O(1og2n) is the smallest known running 
time for inverting a triangular matrix. 
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VLSI COMPUTING STRUCTURES 
FOR SOLVING LARGE-SCALE LINEAR SYSTEM OF EQUATIONS 
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Abstract -— Gaussian elimination 
for solving large-scale linear system of 


algebraic equations is realized with pipe- 


lined VLSI cellular arithmetic networks, 
VLSI arrays are proposed for L-U decom 
position of a dense matrix with pivoting, 
for triangularizing a given linear system 
A - x = b for pipelined solution, for 
obtaining the inverse of a triangular 
matrix, and for matrix multiplication 
used in solving a family of linear systems, 
Modular network realizations of the pro- 
posed VLSI computing structures are 
presented emphasizing practical packaging 
constraints, Structural complexity, 
expandability, speed analysis, memory and 
I/O requirements of the proposed VLSI 
architectures are also discussed, 


1. INTRODUCTION 


Finding fast, accurate, and cost- 
effective methods to solve a large scale 
Linear System of Equations (LSE), in the 
form A - x = b, has been highly demanded 
for centuries by scientists and engineers, 
Due to lengthy sequences of arithmetic 
computations, most large LSEs are solved 
on high-speed digital computers using 
well-developed software packages such as 
the ALGOL-60, FORTRAN, Extended ALGOL, 
and PL/1 programs described in Forsythe 
and Meler 4 ,. Two major difficulties 
arise in solving LSEs on general-purpose 
digital computers by software programs. 
(a) The main memory is not large enough 
to accommodate a very large system matrix 
A. Henceforth, many time-consuming I/0 
transfers are needed in addition to the 
CPU computation time. (b) With fixed word 
length in digital computers, rounding 
errors in algebraic processes if not 
properly controlled may cause serious 
loss of accuracy leading to unreliable 
solutions, . 


In order to alleviate these problems 
presented by software means, the use of 
parallel computers (SIMD or MIMD machines) 
for solving LSEs has been studied by 
Csanky (3], Stone [20], Chen and Kuck [2], 
Orcutt [15], Sameh and Brent 18 , Sameh 
and Kuck {19], and Kant and Kimura [10]. 
The rapid advent of Ve Large-Scale 
Integration (VLSI) technology has created 
a new architecturai horizon in imple- 
menting parallel algorithms directly in 
hardware 13,14,21 . This pessibility has 
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created a new research front on VLSI 
computing structures, as reported in 

Rem and Mead [17], Kung and Leiserson 

(11 ,» Kung [12 » Foster and Kung (6], 

and Horowitz 5 . In particular, Kung 
and Leiserson have proposed the systolic 
arrays for L-U decomposition without 
pivoting (1] « Practical issues on packag- 
ing constraints, memory and I/O supports, 
and modular implementations are still 
open problems towards the eventual reali- 
zation of VLSI computing structures, 


In this paper, a new class of VLSI 
cellular arithmetic arrays is presented 
for solving LSE in a synchronous and 
pipelined fachion, The proposed VLSI 
architectures are structured differently 
from systolic arrays, even both using 
similar building cells. Listed below are 
numerical tasks to be realized with the 
proposed VLSI computing structures. 

(1). L-U Decomposition of a Matrix 
with or without Pivoting. 

(2). System Triangularization and. 
Pipelined Solution of LSEs. 

(3). Matrix Inversion and Matrix 
Multiplication, 


The proposed VLSI arrays and networks 
can be applied to any dense matrices that 
are nonsingular, All the processing cells 
are kept busy all the time, Higher accu~ 
racy and system stability can be achieved 
with maximum column pivoting. The modu- 
larity of the proposed VLSI arrays offers 
better expandability, maintenance and 
application flexibilities, Computational 
procedures in Gaussian elimination with 
and without pivoting are described in 
section 2. Followed are warious VLSI 
atructures and their operational con~- 
siderations, Finally, complexity, ex- 
pandability, speed, accuracy, memory and 
I/O supports, and performances of the 
VLSI arithmetic devices are studied. 


2. NUMERICAL COMPUTATIONS FOR SOLVING 
LINEAR ALGEBRAIC SYSTEMS 


An LSE is characterized by a pair 
(A, b), where A = (a, 5) is annxn 
matrix, b = (bd, 5 bos eee, b,)* isa 
column vector, and n is the order of the 
LSE, The problem of selving an LSE of 
erder nis to find a wector x = 


T 
(x,> Koo cces x,) which satisfies 


(1) 


The solution x is unique, if and only if 
Ais nonsingular. We shall consider only 


atrongly nonsingular systems, in which all 
the diagonal submatrices of A are nonsin- 


gular. 


A+ x=b 


For each nonsingular matrix 


there exists an inverse matrix A 


of A such that AT)» A= Ae AM “1 
where iis the identity matrix. The solu- 


tion vector x can be obtained by leftmul- 


gre 


tiplication both sides of Eq.1 by A> 
xe As b (2) 
If A" is known, it requires n* multipli- | 


cations and n(n-1) additions to compute 
the n components of x. However, to find 


the inverse i is quite complicated and 
should be avoided if umnecessary 4. 


Using the Gaussian elimination me- 
thed, one can systematically decompose A 
into two triangular matrices L and U such 
that 


(3) 


where L = Ch 5) is a lower triangular ma- 


Ek’ ¥ 


trix with all diagonal elements equal to 
1, and U = (u, 5) is an upper triangular 


matrix with nonzero diagonal elements, 
Such an L"U decomposition is unique, if 
and only if A is strongly nonsingular. We 
shall show the calculations of elements 
Rs, and u along with the preposed VLSI 


ij 
arrays. 


The sequence of Gaussian elimination 
operations transforms the dense system 
A+ x = b into an equivalent triangular 
LSE characterized by 


Usx=L 


= 


(4) 


With this triangularized system, one can 
compute the solution vector x by 


-b) = um 


(7+ - d (5) 


1 


U~ and ee always 
L 


are nonsingular, 
We have expressed ie ¢ b= d. Equation 5 
can be also obtained from Eq.2 by the fact 


A pe (L « vu) ee ee Z i 


The inverse matrices 
-@xist, because U and 


The inversion of a triangular matrix 
requires much less computations than that 
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of an arbitrary dense matrix. We shall 
show how to compute d = L™! . » and x = 
yo e d directly with VLSI hardware, Gaus- 
sian elimination procedure automatically 
produces the new coefficient vector d wi- 


thout an explicit evaluation of ae In 


fact, if U = “ay is known, the solution 


given in Eq.5 involves the following re- 
cursive computations. 
Xn qf ean 
n 
u.. 


jeiet 1 


for i = n-1, n-2 °°°, 2, 1 (6) 


This combined computation of ele~- 
ments (u, ,) and (a, ) can be done direct- 


ly by the same soccckis of VLSI array. 
Gaussian elimination without pivoting (na- 
ture ordering of elimination) requires 


n(n” “-1)/3 operations to yield a triangu- 
lar LSE. An eperation here implies a mul~ 
tiplication-addition pair, It takes n(n+1) 
2 eperations to solve one triangular LSE 
using the above recursions. Frequently, 
one needs to compute a family of LSEs cha- 
racterized by the same A matrix res dif 
ferent a i aaa vectors Le mm Pin? Doe 


coos DY yt » for k=1,2,.e+¢,m. The family 
of LSEs nan - x = b, for Kkel,2,cee,m re- 


quires to repeat similar computations m 

times, We propose to solve such family of 
LSEs with a pipelined VLSI multiplication 
array in 2n operation cycles. In contrast, 
to solve m LSEs of order n on a uniproces~ 


sor system requires to perform mn” +(n?=n)5 
operations sequentially. 


3. LU DECOMPOSITION OF A DENSE MATRIX 


Two VLSI processor arrays are pro- 
posed below to realize the Gaussian eli- 
mination method for L“U decomposition; one 
corresponds to Gaussian elimination with 


natural ordering (no pivoting), and the 


other corresponds to with maximal column 
pivoting. For clarity purpose, the decom- 
position procedure is presented by an ex= 
ample LSE of order n=4, 


444 442 943 44 

921 921 923 424 | 
= 931 °32 «933 934 a 

941 942 943 844 


The two triangular matrices, L and U, as- 
sume the following forms, 


1 Oo O 
hoy 1 0 0 
HT Jes, 432 1 0 ie 
yy le 
U4, “42 4430 44 
0 422 423 Yay 
aes 0 0 u u (8b) 
33 434 
0 0 0 Ung 


The following recursions are embedded 
in Gaussian elimination procedures without 
pivoting. 


a5 = a5; ~ (a.4/aq4) x a,; for i,j = 2,3,4 
$8 % e ft] 
a; = ij (a 3/85) x a, for i,j = 3,4 (9) 
i ie | as 40 te ; 
Ss ie Se ae a 


The entries of L and U matrices are 
obtained in terms of these three sets of 
recursively generated coefficients, 


L. = a.,/ay, for i = 2,3,4 
2 ' 
75> = a.o/ ae 
i2 12/422 for i = 3,4 (10.a) 
- te $8 
iz 53/ 33 for i=4 
ae i for j = 1,2,3,4 
i] 
) ae ee for j = 2 
2 j 3 
; = (10.b) 
or a3; for j = 3,4 
i me : 
Ug = Ags for j = 4 
The above recursions require to re- 


peatedly perform Multiply, Divide and sub- 
tract operations in n=4 iterative steps. 
Two types of arithmetic cells as shown in 
Fig.l are required to perform these basic 
computations, One is the M cell for addi- 
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tive multiplication specified by arithme- 
tic equations d = c +a * b, a = a and 
b= b. The other type is called D cell for 
division specified by g = e/f and f = f. 
Note that a small circle at the input or 
output terminal of an arithmetic cell 
means an arithmetic negation, such as a 
two's complement operation. No registers 
are assumed within each cell, Instead, we 
use latch registers between segments of 
arithmetic cells, 


A VLSI array for L”™U decomposition 
without pivoting is shown in Fig.l. For a 


general LSE of order n, (n-1)* M cells and 
n-] D cells are needed in the cellular ar- 
ray construction, In order to facilitate 
synchronous pipelined operations, fast in- 
terface latches are used between Segments 
(rows) of arithmetic cells. 


Input operands are from the elements 
of matrix A, feeding in one column 


(a, ,) 
per each cycle as shown, The VLSI pipeline 
has two-way traffic flows, Data streams 
flow first upward. After reaching the top 
segment of division cells, data streams 
then flow downward. Due to this two-way 
pipelining, a number of dummy numeric ze- 
ros and ones are interleaved with the ma- 
trix elements (a, ;) in the input data 


streams. The array outputs are elements 
(u, )s (K, 5) of matrices U and L with also 


some dummy interspaced zeros, The proper 
timing of segment delays is marked at the 
side for each cycle of the pipeline, In 
general, 3n-2 pipeline cycles are needed 
to generate all the elements of L and JU 
matrices, Between successive applications 
of the two-way pipeline, start-up delays 
of n-1 cycles are needed to drain the pipe-~ 
line as shown by the first three cycles 
(t, through ts) in Fig.l. 


Gaussian elimination with natural 
order may result in serious accuracy 
loss problem. For an example, the division 
of a product terms by a very small number 
(Eqs. 9 and 10) may cause overflow beyond 


the precision limit of the machine, There- 
fore, we wish to choose the maximal divi- 
sors, called pivot elements, in the elimi- 
nation process. The maximal pivot Se— 
lected among all remaining rows and co~ 


lumns of the matrix being triangularized, 


will cause least loss of accuracy. We im~- 
plement the maximal columns pivoting, 


which searches for the maximal element on- 
ly among each remaining column, This is 
especially convenient, because the ele~ 
ments of A are fed into the pipeline by 
column as demonstrated in Fig.l. 


A VLSI array is proposed in Fig.2 for 
L-U decomposition with maximal column pi- 
voting. This array is modified from the 


array in Fig.l by adding additional pivot 
selection logic, The Pivot Indicator (PI) 
is a logic device which indicates the ma- 
ximal among a column of matrix elements, 

The outputs of PI are Boolean values, "1" 
signaling the location of the maximal co- 
lumn pivot and "0" for the rest elements, 
The successive outputs of PI for j = 1,2, 
3,4 are labeled by I, for segments i = 


1,2,3,4%. The Pivot cranes (PE) unit has 
8 a noutes four of which are the column 
elements and the other four are the cor- 
responding Boolean indicators (I, ) from 


the outputs of the PI on the ee. of the 
drawing, The PE will interchange at its 
output the indicated pivot element with 
the leftmost column element, The non~pivot 
input elements are passed to the corre-~- 
sponding output lines unchanged, In other 
words, the PE will always output the pivot 
element at its leftmost output line, When 
the original leftmost input is itself the 
indicated pivot, no exchange will be made 
and the PE will simply pass all its inputs 

unchanged to the corresponding outputs. 
Details of the pivoting logic can be found 
in Ref, [9]. The input/output of the array 
in Fig.2 is labeled in Table l, 


4, SYSTEM TRIANGULARIZATION 
AND SOLUTION PIPELINE 


Substituting Eq.3 into Eq.1, we ob- 
tain L « (U - x) = b. This actually re- 
presents two triangular systems interlock- 
ing each other, The forward elimination 


corresponds to the lower triangular system, 


Lsdez=b (11.a) 
and the backward substitution corresponds 
to the upper triangular system 
Uexsd (11.b) 
The solutions of these two triangular sys- 
tems will lead to the final solution vector 
x. In this section, we wish not to compute 


the inverses ig and uy! to obtain the 


solution vector x. 


Shown below is a VLSI array for tri- 
angularizing A into U and at the same time 
obtaining the new coefficient vector d and 
L. Let us rewrite Eq.ll.a for the | exemple 
LSE of order 4, 


1 0 0 d, ; 
hoy 1 0. d, b, 
43 832 1 [ds] ~ [bs ae 
hay 42 843 qd) 1, 


We relabel the column vector (b,, bj, by, 

T | 
b, ) = (a, 59 A559 Agus ays )- Using the 
similar formulation of Eq.9, we can extend 
the column index to j=5 to obtain the fol- 
lowing recursions with six additional 
terms. | : 

ai = aa - (a. or. x 44; 

for i = 2,3,4 and j = 2,3,4,5 


8 q 

as; = aij - (a. oles) x a; 
for i = 3, 4 and j = 3,4,9 (13) 

Le 


) xX a3; 


i os ( "7 
oy 284s 7 is!83s 
for i = 4 and j = 4,5 


The solutions of the forward tri- 
angular system (Eq.12) can be recursively 
computed below using Eq.10.a and Eq.13. 


ee ee 
1 = By = age 
d (b ) 
= - °* b = Qa 
2 1 
2” “24 25 (14) 
d, = (b> - £5, ° b,) - £4. ° Cb h.. * ba) 


Note the similarity between the ex- 
pressions in Eq.14 and those for (u, 5) in 


Eq.10.b. We can use an extended VLSI 
structure to compute Eq.14 modified from 
the previous L~U decomposition arrays 
(Fig.1). This array as detailed in Ref. 
[9] can simultaneously generate the ele- 
ments of matrix U and of vector d without 
explicitly computing the elements of in- 


verse matrix ee Such an array can di- 
rectly convert the original LSE into an 
equivalent triangular system, The ele- 
ments of matrix A and of vector b serve 
as the inputs and elements of U and vector 
dad are the outputs, 


The solution of the upper triangular 
system (Eq.11.b) has already recursively 
specified in Eq.6 in terms of (d,) and 


(u, 5) The VLSI array shown in Fig.3 is 


specially designed to generate the solu- 
tion vector x of the LSE, In order to 
connect the outputs of the triangula- 
rizing array directly to the inputs of 
this LSE solver, some precautions must be 
made to provide the necessary interface 
delays to match the speed of two data _ 
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streams, 
5. MATRIX INVERSION AND MULTIPLICATION 


In application where repeated solu- 
tions of A+ x, = b are needed over a 


set of coefficient vectors b. for k2l,2, 


eoeym, the use of the inverse matrix ac 


may become very attractive to generate 
the set of solution vectors via the se- 


quence of computations x, = a ¢ b. for 


kel,2,cce,m. In this sequence, the inverse 


<? need be computed only once. According 


to Eq.5, one can reduce the problem (after 
Gaussian elimination) to as follows: 


xe Ate bp we (Le ute by 


= (uv) + Lo) -) 


b. for kal,2,cce ym 
We present below VLSI pipelines for 
finding the inverse matrix i = (mj ;) 


L = (f, ) and the inverse matrix ut = 
(v, 5) from U = (u iy)° Based on the tri- 
angular forms of L and U specified in Eq. 
8, L7 and U7! will be also triangular 
matrices, 
1 0 0 0 
m 1 0 0 
. 21 
L ' (mo = | a a. (16.a) 
4 31. "32 
m1 2 ™3 | 
M11 “42, 43 M44 
0 Vv v v 
- 22 “23 
uv! = (v..) = 0 0 (16.b) 
ie | 33 ¥34 
0 0 0 Veg 


Matrix multiplication is performed 
to obtain the inverse matrix A Gr Se ° Oi 
“1 and i=, Let e, be the unit co- 


lumn vector whose components are all zero 
except the k-th component, which is one, 


The columns of ie are simply the respec- 


tive solutions of the following n LSEs, 
(17) 


One can write the n column vectors 
YX and 2. into a matrix form as Y = 


from U- 


Ue pL aE S,. for Kal, 2,.00,n 
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(yy Loveres Y,, ) and I =(s)+ Boreces eo 
where I is the identity matrix. Then the 
family of LSEs in Eq.17 can be rewritten 
into the compact form U +> Y = I. This im- 


plies that Y = U Ras? The following recur- 
sive formula is used to compute the ele- 


ments (v, 5) of the inverse U- ox from a 
given matrix U = (u, 5). 
1 
Vie =—— ~~ for k=1,2,°°°,n 
kk Urp ee 
(18) 


Vv... == > U., X ait 
1jJ [ =, ik “Kj 454 


- for all j > i 


For the LSE of order n=4, we have 


Vik = V/uy for k=1,2,3,4 
V42 = (uy, x Vo9)/U44 
¥y3 = ~Cya X Vag + Ugg % V33)/U4, 
‘ | (19) 
U43 % Va, + Uy, X Vy, /uy, 


Mag = 7% 42 * Voy 


V33)/Uz5 


+ u 


“Cag X Vay + Us, X Vy, 


Vzq = ~Cug, X VyQ) Jug, 


The VLSI array for finding the inverse 
matris i 
ray can be obtained for finding u- 
U as detailed in [9]. 


is shown in Fig.4,. Similar ar- 


oe from 


We present next a VLSI pipelined ar- 
ray of M cells for the multiplication of 
two arbitrary dense matrices. Obviously, 
this array can be used to compute the in- 


verse matrix ye by performing the multip- 


lication U™ oer i, The array structure 


is depicted by the multiplication of two 
323 square matrices. 


142 «943 117 742 «943 

A = 
AX B= 1824 822 423] * [boy boo bash 
934 932 433} = | b34 gag 


Cay S42, 843] | 
= 1eaq Sip fs) = £ - (20) 
fz, 32 S33 


» where the product coefficients c = 


3 ij 
= x - b,, for all i and j. 
2 


The rectangular array design is shown 
in Fig.5. The elements of matrices A and B 
are fed from the lower and upper input 
lines in a pipelined fachion, one skewed 
row or one skewed column at a time, Some 
dummy zero inputs are interspaced with the 
matrix elements. In general, to multiply 
two n x n matrices requires n(2n ~ 1) mul- 
tiply cells (M cells). The start-up delay 
corresponds to the longest path on this are 
ray, which equals 2n-1 clock periods, This 
array differs from the systolic array [11] 
in both interconnection structure and the 
way inputs are applied and outputs are re- 
trieved, If one counts the start-up delays, 
the time required to produce the last proe- 


duct term c (c,., in Fig.5) equals 4n-2 


clock periods, 


For triangular matrices, such as U7? 
and L~* specified in Eq.16, the full mul- 
tiplication array must be used, This is 

due to the fact that their product matrix 


ee = uo ° po is, in general, an arbi- 


trary dense matrix, 


The collection of m column-vector 
T 

solutions x, = (x40 Xone ces x) 
k=l,2,ee6,m specified in Eq.15 can be 


generated in pipelined fashion by carrying 
out the following matrix multiplication. 


for 


mea B (21) 
where | | | | 
X44 Xa "May 
X21 %22 °°" Xomh 
‘ ; , “eee? (22) 
a ae : | 
Xn *n2 ~~ Xm 


(Pia a2, 4m 
bay P22 °° amt 
ej ° _  @ | (23) 
8 = a: 
nt %n2 7" Om 


and a = (c,,) is the inverse matrix of 
array of Fig.5. When m = n, one can simply 
reuse the array of Fig.5 to compute the 
solution matrix X. When m >n, the multi- 
plication array must be expended in one 

of the two Gimenstouss say adding more 
TrOWS. | 


ij 
generated by the multiplication 


Since the elements (ce, ;) of the in- 


verse matrix A” — will be repeatedly used, 


we have devised a special VLSI array for 
carrying out the multiplication specified 
in Eq.21, This alternate array being re- 
ported in [9] is singly pipelined only in 


the vertical direction. The (c,,) entries 


are fed through a fan-in demultiplexer and 
distributed to all M cells, The column 
elements of the matrix B are fed through 
the vertical inputs, one skewed column at 
a time via a fan-in multiplexer. After the 


first solution x, = (<i ot ipessy es) ap~ 


pears at the output end, one solution 
vector will appear at each additional 
cycle, The attractive part of this array 
is that it is applicable to any number of 
m LSEs in a pipelined fashion, 


6. MODULAR NETWORKING OF VLSI 
COMPUTING STRUCTURES 


VLSI devices must grow gradually. It 
is by far constrained by chip density, 
packaging area, and pin limitations. To 
built a “very large" LSE solver, say of 


order 10° or greater, on a monolithic Chip 
depends on how these constraints can be 
overcome, Extensive development efforts 
are still needed to develop VLSI modules, 
which can be interconnected to form a net- 


‘work LSE solver, We propose below two con- 


crete examples of such a networking ap-. 


proach, The first example shows the modu- 
larization of rectangular VLSI computing 
arrays and the second one tor triangular 
VLSI arrays. 


A general LSE of order n requires 
4n-2 I/O ports in Fig.1, each of which has 


_a width of w bits (equal to the operand 
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length). For large n and w, this implies 
an exceedingly large number of external 


leads on the VLSI chip. Obviously, the 


projected IC packaging technology renders 
such VLSI arrays unrealistic, To alleviate 
this problem of constrained 1/0 leads, a 
fan-in demultiplexing and fan-out multip- 
lexing scheme is depicted in Fig.6 for 
general VLSI computing structure with kxm 
parallel inputs and rx s parallel oute 
puts. After the I/0 multiplexing, reasonab- 
ly small numbers of k inputs and r outputs 
are allowed at the I/O ends. In order to 
ensure proper serial-to-parallel and paral- 
lel-to-serial conversions as demonstrated, 
at least two clocks, Cy and Cos are needed 


per each VLSI device, The clock C, 


to control data in and out of the input and 
output registers respectively. The array 
clock C, has a period equal to km or rs 


2 
times that of the period p of clock C,- The 


multiplicity reflects the degree of multi- 
way conversion logic used at the I/0 ends, 
Cc. is the array clock controlling all the 
latches in the VLSI array. The timing rela-~ 
tionships of the two clock signals are de-~- 
monstrated in the lower half of Fig.6. The 
actual numbers, k and r, of inputs and out- 
puts are determined by the chip packaging 
requirement. | 


is used 


The rectangular VLSI array in Fig.1 
can be partitioned into two types of VLSI 
modules as shown in Fig.7. Using these two 
module types, one can construct an L~U de-~ 
composition networks of arbitrary high 
orders. The multiply module, M(q x q) cor- 
responds to a q-by-q subarray of multipl 
cells in Fig.1. The division Module, Dla), 
corresponds to the top row of division 
cells in Fig.1. Multiple number of division 
modules can be fabricated on the same chip, 
say q D(q) modules on a chip, which would 
be comparable in complexity with one M (q 
% q) module, 


In Fig.7, four M(q x q) modules and 
two D(q) modules are used to construct an 
L“U decomposition network LU(2q +1) for 
an LSE of order n = 2q +1. In general, an 
LU(n) network of order n = peq + 1 requires 


p* M(q x q) modules and p D(q) modules, One 
additional demultiplexer/multiplexer pair 
is needed between the network modules and 
the external memory system, where the oper- 
ands and results suppose to reside, 


The partition of triangular VLSI ar- 
rays into a network of VLSI modules is ex- 
emplified by a modular matrix inversion 
scheme. The interconnection of three tri~ 
angular multiply modules T(q) and three 
square multiply modules S(q x q) shown in 
Fig.8 produces a matrix inversion network, 
MI(3q) for computing the inverse matrix 
#1 
L 


ss (m, 5) of a lower triangular matris 
L= (2, 5) of order n = 3q. 
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In general, a matrix inversion net- 
work MI(n) of order n = peq requires p 
T(q) modules and pe(p - 1)/2 S(q X q) mo- 
dules. Datail designs of there VLSI mo- 
dules can be found in [9]. 


Ze STRUCTURAL COMPLEXITY AND 
PERFORMANCE ANALYSIS 


It has been predicted that by the 
late 80's it will be possible to fabricate 


IC chips, each of which contains 10! or 


10° indiviaual transistors [14,17]. New 
high-resolution lithographic techniques 
have already demonstrated the feasibility 
of achieving such VLSI devices with NMOS 
technology. The VLSI computing structures 
require not only large number of process~ 
ing cells and latch memory, but also large 
number of conducting paths for communicat- 
ing information throughout the integrated 
system, The length and organization of 
these communication paths set a lower 
bound on the chip area and time delay re- 
quired for system operations [17]. Fur- 
thermore, the 1/0 and packaging con- 
straints of monolithic VLSI chips has set 
limitations to the applicability of VLSI 
chips in digital system design, 


The structural complexity of VLSI 
computing structures is estimated at the 
logical level in terms of the number of 
processing cells used in a schematic ar-~ 
ray layout or in terms of VLSI modules 
used in a network construction, The poten- 
tial speed of a VLSI device is determined 
by the total clock periods needed for a 
specific computation sequence, We lump the 
path delays into the cell delays. The mul- 
tiply cells (M cells) can each assume the 
glebal cellular structure of carry-save 
adders as in Hwang [7]. The division cells 
(D cells) can assume the cellular struc- 
tures suggested by Cappa and Hamacher [1] 
and also those described in Hwang [8]. We 
firmly believe that the use of interface 
latches instead of registers in cells will 
better facilitate the control of pipelined 
operations. 


With the same word length, the M 
cells and D cells should have about equal 
time delay, say A time units per cell of 
either type. It is now possible to achieve 
24-bit=-by-24—bit multiplication of divi- 
sion with LSI bipolar cellular arrays in 
less than 200 nanoseconds, The delay of 
the pivoting logic per each pipeline seg=- 
ment in Fig.2 is denoted by § time units. 
The interface latch delay is negligible, 
when compared with Aor®. Therefore, the 
segment delay between two adjacent adja- 
cent latches, equals A + or A depending 
on whether pivoting is used or not, This 
means that the internal array clock (Cc, 


in Fig.2) of the pipeline may have a peri- 


od p equal to AorA+d. 


Consider an LSE of order n, The num- 
bers of arithmetic cells (either M or D 
cells) required in each of the presented 
VLSI arrays are summarized in Table 2, The 
numbers of I/O terminals are also shown. 
These 1/0 terminal counts refer to the 
parallel inputs and parallel outputs to or 
from the internal VLSI array before using 
the fan-in and fan-out conversion inter- 
faces as demonstrated in Fig.6. The start- 
up delays for draining the array pipelines 
and the net compute time are expressed in 
terms of parameters n, A, and g. The sum 
of the start-up delay and the compute time 
equals the total compute time required to 
complete the specified sequence of compu- 
tations, In all cases, the total compute 
time of each of the proposed VLSI arithme- 
tic pipelines is linearly propertional to 
the order n of the LSEs, This iniplies a 


speedup from O(n”) or O (n>) operations 
required in a serial computer to 0 (n) 
steps using VLSI computing networks, For 
large values of n, the speedup is rather 
impressive, 


The proposed VLSI arrays are expan-~ 
dable to allow modular growth. The L~U 
decomposition arrays and the system tri- 
angularization array can be each expanded 
by adding more rows of M cells at the bot- 
tom and extending the lengths of all rows 
to the right. Without pivoting logic, such 
extension can be done by using modules as 
demonstrated in Section 6, With pivoting, 
the array must be expanded into the third 
dimension in order to achieve modulariza- 
tion, Pivoting will increase the accuracy 
and stability of the solution to an LSEs, 
This is an improvement over the systolic 
arrays. Any dense system with a "strongly" 
nonsingular matrix A can be solved by the 
proposed VLSI networks. 


The modular requirements for con- 
structing VLSI computing networks demon- 
strate a tradeoff between module sizes q 
in M(q x qa), D(q), and S(q x a), and the 
network sizes n in LU(n) or MI(n). The 
proper choice depends heavily on the VLSI 
technology and packaging capability. We 
have assumed the continuous supply of oper- 
ands either from the main memory or from 
a cache memory. The operand supply rate 
may be slower than the array processing 
rate. For small matrix A, this problem can 
be solved by using a large cache memory, 
However, large data buffer may increase 
the cost of the computer system signifi- 
cantly. The I/O interface structure matches 
the speed of VLSI devices and that of me- 
mories from which the matrix or vector 
elements are retrieved, 
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8. CONCLUSIONS 


We have proposed a complete set of 
VLSI arithmetic array and network archi- 
tectures for implementing Gaussian elimi- 
nation method to solve LSEs. The L~U de- 
composition is realized in hardware with 
and without pivoting. The triangulariza- 


tion and solution of a dense linear sys- 


tem are realized directly with hardware 
arrays without explicitly finding the in- 


verse matrix yee For solving a family of 
LSEs characterized by the same matrix A, 
we have proposed the matrix inversion and 
multiplication arrays for generating the 
sequence of solutions in a pipelined fash- 
ion, Modular networking and efficient I/0 
structures are also presented for VLSI 
computing structures, Continued efforts 
are being exerted on the development of 
bit-slice VLSI computing structures, 
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Triangular Modules T(q) to form A 
Matrix Inversion Network MI(3q) of 
order 3q 
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Fig.7 Interconnection of Four M(qxq) 
Modules and Two D(q) Modules to Form 
an L“U Decomposition Network LU(2q+1) 
of order 2q+1 
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summary 

Several recent papers have proposed 
parallel adaptations of the sequential 
alpha-beta algorithm [1l, 2, 3, 6]. The 
present paper derives time and storage 
requirements for one such adaptation [1]. 
Alpha-beta search is fundamental to 
artificial intelligence research as many 
game playing programs employ it [4]. A 
simulation of a multiprocessor is used to 
derive timing requirements in terms of 
nodes visited, nodes scored, and elapsed 
time. The simulation environment in- 
cludes hardware processors and software 
processes. Storage requirements for the 
algorithm are derived analytically. The 
tradeoff between time and storage "cost" 
in the algorithm is demonstrated. 


The basis for our parallel imple- 
mentation of the alpha-beta algorithm is 
the following: assuming that the tree to 
be searched is perfectly ordered, those 
nodes that must be scored are (concur- 
rently) visited first. The algorithm is 
designed to minimize the run time of the 
search and to perform as many cutoffs as 
possible, thereby minimizing the cost of 
the search (total number of operations). 


To achieve these goals a distinction 
is made among the sons of a node. The 
first son of a node is called the "left 
son". The subtree containing the left 
son is called the "left subtree" and the 
process that searches this subtree is the 
"left process". All other sons of a node 
are called "right sons" and are contained 
in "right subtrees" which are searched by 
-"yight processes". 


The left subtree of a node is 
searched by a left process (which is 
spawned by the parent node) until a final 
value for the left son is backed up to 
the parent node. To obtain this final 
value, the left son's process spawns 
processes (lefts and rights) to search 
all of the left son's subtrees. Con- 
currently, a single, temporary value is 
obtained for each of the right sons of 
-the parent node. These values are then 
compared to the final value of the left 
son and cutoffs are made where appro- 
priate. 
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The temporary value for a right son 
is obtained by the right son's process 
Spawning a process to search its left 
subtree. This new process searches the 
subtree, backs-up a value to the parent's 
right son, and then dies. If after a 
cutoff check the right subtree search 
continues, then a process is generated to 
search the second subtree of the right 
son. This procedure continues until 
either the subtree is exhaustively 
searched or the search is cut off. 


It is clear that, by applying the 
above method, those nodes that must be 
examined by the alpha-beta algorithm will 
be visited first. This ensures that 
needless work is not done; a cutoff check 
is performed before processes are gene- 
rated to search subtrees that may be cut 
off. 


In a search with more processors 
than running processes it may be possible 
to minimize the runtime of the search by 
generating processes to search the sons 
of a right node concurrently using the 
idle processors. This brute force 
approach is not used since it conflicts 
with the other aim of our design, namely 
minimizing the cost of the search. The 
cost of any tree search consists mainly 
of the cost of updating the system in 
moving from parent to son and in the cost 
of evaluating or scoring a node. There- 
fore even though a processor (which could 
be doing concurrent work) is idle, the 
overall cost in operations is minimized 
by not searching subtrees which may not 
have to be searched. 


There are seven main components of 
the parallel alpha-beta algorithm: 
Initialize, Handle, Score, Generate, 
GenerateMoves, Apply, and Update. 


1) Initialize reads in the original 
board position (i.e., the configuration 
for the root node of the search tree) and 
the depth to which the tree will be 
searched. Handle is then invoked to 
create a process for the root. 


2) Handle is a recursively-defined 
process. It searches a node in a game 


tree by calling either Score (for a leaf) 
or Generate (for a non-leaf) and then 
calling Update. | 


3) Score returns an integer repre- 
senting the value of a given board con- 
figuration. 


4) Generate searches a subtree that 
is not a leaf. It calls GenerateMoves to 
produce a list of moves from the current 
position. If the root of the subtree is 
a left node, then Handle is invoked once 
for each son. The processes thus created 
run concurrently, and Generate waits 
until they all terminate. If the root of 
the subtree to be searched is a right 
node, then the sons are searched in 
sequence by calling Handle for one of 
them, waiting for it to complete, and 
performing a cutoff check before search- 
ing the next son. Apply is used to pro- 

duce board configurations for sons. 


5) GenerateMoves produces all of the 
legal moves from a board configuration. 


6) Apply produces the board configu- 
ration that results when a given move is 
made on a given board configuration. 


7) Update waits until the parent's 
score table is free and then copies the 
value derived as a score for the current 
subtree into the table, if applicable. 


Since we did not have a multipro- 
cessor available on which to implement 
our algorithm, the simulation language 
GASP IV [5] was used to simulate physical 
parallelism. As our model of computation 
we use an MIMD computer. The machine we 
intend has a number of asynchronous pro- 
cessors with a communication facility 
provided by common memory or communica- 
tion lines. A processor can initiate 
another processor, send a message to 
another processor, or wait for a message 
from another processor. Apart from these 
interactions, processors proceed indepen- 
dently. 


The simulated environment provides 
multiple software processes and multiple 
hardware processors. A process is created 
for each node that is searched. The 
number of processors is a parameter of the 
program. 


The implemented algorithm was experi- 
mented with to study the effects of paral- 
lelism on the cost of a tree search, this 
cost being expressed in: 1) run time of 
the tree search, 2) number of terminal 
nodes scored, and 3) total number of non- 
terminal and terminal nodes visited. 


A uniform tree of a given depth and 
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branch factor was generated and stored 
prior to the search. The terminal nodes 
of this tree were assigned scores chosen 
from a particular probability distribu- 
tion. The principal continuation was 
sought and the three measures of cost 
recorded. Typical results of experiments 
are shown in Figs. 1 and 2. 


The curves in 
run time decreases 


Fig. 1 show that the 
sharply with an in- 
creasing number of processors doing the 
search. As expected, the total number of 
nodes visited also increases with an in- 
creasing number of processors as can be 
seen in Fig. 2. 


To analyse the storage requirements 
we first assume that an infinite number 
of processors is available to search the 
tree. During the first phase of the 
algorithm, knowledge about the behaviour 
of the sequential version is used to 
explore several paths concurrently and 
independently. During all the remaining 
phases several subtrees are searched in 
parallel, each subtree, however, being 
searched sequentially. 


Fig. 3 shows a uniform tree whose 
depth and branch factor are both equal to 
three. The paths explored in parallel 
during the first phase are indicated by 
heavy lines. Nodes explored during the 
first phase are called "primary" nodes. 
Formally, 

1) the root is a primary left son, 

2) a primary left son at ply k is 
the left son of a primary left or 
right son at ply k - 1, and 
a primary right son at ply k is a 
right son of a primary left son 
at ply k - l. 


3) 


Following the first phase the tempo- 
rary score backed up at node 1 is compared 
with the ones at nodes i and j; if the 
former is smaller, then the subtrees of i 
and j need not be considered at all. 
Otherwise these two subtrees, shown cir- 
cled in Fig. 4, are searched in parallel 
(each sequentially) during the second 
phase. 


When these two subtrees have been 
fully searched the final score backed up: 
at node 1 is compared with the temporary 
score at node m for a cutoff. If the 
former is larger, the cutoff check is 
successful and the unexplored subtrees of 
m need not be considered. Otherwise, more 
subtrees, shown circled in Fig. 5, are 
searched in parallel (each sequentially) 
during the third phase and so on. 


At least one storage location is 
needed to hold the temporary score of each 
node being explored. When a node is dis- 


carded from further consideration its 
storage locations are reallocated to an- 
other unexplored node that the algorithm 
decides to examine. Therefore it is 
necessary to derive the maximum number of 
nodes simultaneously explored at any time 
during the search. This number is pre- 
cisely the number of primary nodes. 


To see this note that any tree 
searched sequentially during the subse- 
quent phases is rooted at a node that was 
primary, that is to say explored during 
the first phase. This subtree is iso- 
morphic to the leftmost subtree rooted at 
the same primary node. The leftmost sub- 
tree has at least as many primary nodes 
as a subtree searched in subsequent 
phases. Therefore the number of nodes 
searched in parallel during the second 
and later phases cannot exceed the number 


of primary nodes. Let 
1(k) = number of primary left sons at 
ply k, and 
r(k) = number of primary right sons at 
ply k. 
In Fig. 3, 1(3)=5 and r(3)=6. Fora 
uniform tree: 
1(k) = 1(k-1) + r(k-1) , k2l1 
r(k) = 1(k-1) * N y: Ken 
1(0) = 1 and r(0) =0O , 
where N stands for the branch factor minus 
one. For a uniform tree of depth D, the 


total number of primary nodes is therefore 
given by 


1(k) + r(k) 


and the storage requirements of the 
algorithm are of O(S). 


It is clear that our assumption 
about the availability of an unlimited 
number of processors can be relaxed. The 
maximum number of processors the algorithm 
will ever need to search a uniform tree 
of depth D is 


P 


1(D) + r(D) . 


In Fig. 3, P=ll. 

Even though P establishes an upper 
bound it is still a very large number of 
order ND/2, as one should have expected. 
In practice a small number of processors 
running in parallel is usually sufficient 
to achieve a substantial reduction in the 
running time of the sequential alpha-beta 
algorithm. In fact, it was observed that 
the run time cannot be decreased below a 
certain level no matter how many more 
processors are used in the search. The 
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number of processors p* which first 
achieves this minimum run time was recog- 
nized as the "Saturation point" of the 
algorithm. 


These remarks lead us to reconsider 
our definition of primary nodes. The 
actual number of primary nodes is in fact 
determined by the number of processors 
available. If p processors are used to 
search a uniform tree of branch factor 
N + 1, then the actual number of primary 
nodes at level k is 


min {l(k) + r(k), p} 


and the total number of primary nodes for 
a tree of depth D is 


min {l1(k) + r(k), p} : 
0 


D 
s (p) } 

k= 
Under these conditions the storage re- 
quirements of the algorithm are O(s(p)). 
Note that S=s(P) and that for psN+l, we 
have 


s (p) 1 + pD ‘ 

Combining the experimental timing 
results with the analytical storage 
results and making a typical "time versus 
storage tradeoff" decision the optimum 
number of processors to be used can be 
determined. This is indicated by the 
graphs in Figs. 6 and 7. The curve in 
Fig. 6 is plotted empirically by varying 
the number of processors searching uni- 
form trees of depth D and branch factor 
N+ 1. The curve in Fig. 7 is obtained 
analytically using the expression for 
s(p). The two curves are used to deter- 
mine p+, the optimum number of processors 
matching the available resources. 
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Abstract 


We present a distributed algorithm 
for implementing d-B search on a tree of 
processors. Each processor is an in- 
dependent computer with its own memory 
and is connected by communication lines 
to each of its nearest neighbors. Meas- 
urements of the algorithm's performance 
on the Arachne distributed operating 
system are presented. A theoretical 
model is developed that predicts speedup 
with arbitrarily many processors. 


1. INTRODUCTION 


The d-B search algorithm is central 
to most programs that play games like 
chess. It is now well-known [1] that an 
important component of the playing skill 


of such programs is the speed at which 
the search is conducted. For a given 
amount of computing time, a faster 
search allows the program to "see" 
farther into the future. In this’ paper 


we present and analyze a parallel adap- 
tation of the q-B algorithm. This adap- 
tation, which we will call the tree- 
splitting algorithm, speeds up the 
search of a large tree of potential con- 
tinuations by dynamically assigning sub- 
tree searches for parallel execution. 

In section 2, we Summarize the d- 
algorithm. Section 3 reviews a paralle 
implementation of the q-B algorithm sug- 
gested by Baudet [2]. Section 4 formal- 
ly describes the tree-splitting algo- 
rithm. Section 5 presents performance 
measurements for this algorithm taken on 
a network of microprocessors. Section 6 
discusses some possible optimizations 
and variations of the algorithm. Sec- 
tion 7 derives the obtainable speedup 
with k processors, as k tends towards oo. 


2. THE ALPHA-BETA ALGORITHM 


Consider a board position from a 
game like chess or checkers. All possi- 
ble sequences of moves from this_ posi- 
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predecessor. 


tion may be represented by a tree of po- 
sitions called the lookahead tree. The 
nodes of the tree represent positions; 
the children of a node are moves’ from 
that node. The root node of the tree 
usually represents the current position. 
Since lookahead trees for most games are 
usually too large to be searched even by 
computer, they are usually truncated at 
a certain level. Since we will later be 
referring to a tree of processors, we 
reserve the following notation for nodes 
of lookahead trees: A node is often 
called a position. A node's child is 
its successor, and its parent is its 
If each non-terminal node 
has nn successors, we say that the tree 
has degree n. The level of a node or 
subtree is its distance from the root. 

The d-B algorithm is an optimiza- 
tion of the minimax algorithm, which we 
will review first. The two players are 
called max and min; at the root node, it 
is max's turn to move. The minimax al- 
gorithm proceeds as_ follows: First, 
each leaf of the lookahead tree is as- 
Signed a Static value that reflects that 
position's desirability. (High values 
are desirable to max. In a game like 
chess, the main component of the value 
is usually the material balance between 
the two sides.) 

The interior nodes of the lookahead 
tree may be given minimax values recur- 
Sively: If it is max's turn to move at 
node A, the value of A is the maximum of 
A's successors! values. (If the game 
were to proceed to node A, it would then 
be max's turn to move. Max, being ra- 
tional, would choose the successor with 
the maximum value, say M. Therefore, 
the subtree rooted at A must have M as 
its value, because M is the value of the 
leaf node we would reach if the game 
reached A.) Similarly, if it is min's 


turn to move at a node, then the value 
of that node is the minimum of these 
values. 

We will uSe aversion of the 


minimax procedure called negamax: When 
it is max's turn to move at a terminal 
node, the node is assigned the same 
Static value used in minimax. When it 
is min's turn to move, the static value 


assigned is the negative of what it 
would be in the minimax case. The value 
of a nonterminal node at any level is 


defined to be the maximum of the nega- 
tives of the values of its successors. 

The negamax algorithm can be cast 
into an ad hoc Pascal-like language. 
The following program is adapted from 
Knuth [3]: 


function negamax(p: position) :integer; 
var m: integer; 
i,d : 1..MAXCHILD; 
succ : array[1. .MAXCHILD] of position; 
begin 
determine the successor positions 
Succ[ 1 )/.s6.;succ(d]: 
if d = 9 then { terminal node. } 
negamax staticvalue(p) 
else . 
begin { find 
m s:= — ©; 
:= 1 to d do 
m := max(m,- negamax(succ[i]); 
negamax := m; 
end 
end. 


maximum of child values } 


evaluates’ the 
pursuing ir- 
Suppose we are in- 
vestigating the successors in a game of 
chess, and the first move we look at is 
a bishop move. After analyzing it, we 
decide that it will gain us a= pawn. 
Next we consider a queen move. [In con- 
sidering our opponent's replies to the 
queen move, we discover one that can ir- 
refutably capture the queen; she has 
moved to a dangerous spot. We.need not 
investigate our opponent's remaining re- 
plies; in light of the worth of the 
bishop move, the queen move is’ already 
discredited. 

The q-B search algorithm 
malizes this notion: 


The d-B algorithm 
lookahead tree without 
relevant branches... 


for- 


[3] 


position; 
integer; 


function alphabeta(p : 
| q,B : integer) 
label DONE; 
var i,d : 1.-MAXCHILD; 
succe : array[1..MAXCHILD] of position; 
begin 
determine the successor positions 
succ[1],...,succ[d]; 
if d = @ then 2 
alphabeta := staticvalue(p) 
else : A oy 
begin 
for i := 
begin 
Qo 22 


1 to d do 


max(q, - alphabeta(succ{ il, 
, —B, 1) s 
if d > B then goto DONE { cutoff o 
-end; 
DONE: alphabeta 
enc | 
end. 


= 
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between the number of nodes searched 


The function alphabeta obeys the 
accuracy property: For a given position 
p, and for values of q and B such that qd 


< B, 
if negamax(p) < d, 


then alphabeta(p,d,B) < q 
1 negamax(p) 2&8, 
then alphabeta(p,d, B) > B 


if qd < negamax(p) < B, then 
alphabeta(p,d, B) = negamax(p) 


The first and second cases above 
are called failing low and failing high 
respectively. In the third case, suc- 
cess, alphabeta accurately reports the 
negamax value of the tree. Success is 
assured if dq = -— o and B =0oo. The pair 
(q,B) called the window for’ the 
search. : 

To return to our example: 
phabeta is called with p representing 
the queen move, it is min's move. is 
the cutoff value generated by the bishop 
move. The better the bishop move was 
for max, the lower is B. (Within the 
routine alphabeta, high values for qd and 
B are good for the player whose move it 
is. A high value for qd indicates that a 
good alternative for that player exists 
somewhere in the tree. A low value for 

indicates that a good alternative ex- 
ists for the other player somewhere else 
in the tree.) When the successor that 
captures the queen is evaluated, qd be- 
comes larger than BB, and a cutoff oc- 
curs. 

A-B 


branching 


is 


When al- 


reduce the 

the ratio 
in 
N and one of height 


to 
is 


pruning serves 
factor, which 


a tree of height 
N-l, aS N tends tooo. Both theory [3], 
and practice [4] agree that with good 
move ordering (investigating best moves 
first), Q-B pruning reduces the branch- 
ing factor from the degree of the looka- 
head tree nearly to the square root of 
that degree. For a given amount of com- 
puting time, this reduction nearly dou- 
bles the depth of the lookahead tree. 
When the algorithm is performed on 
a serial computer, the value of one suc- 


cessor can be used to save work in 
evaluating .its siblings later on. 
Nevertheless, greater speed can be ob- 


tained by conducting d- search in a 
parallel fashion. We define the speedup 
of a parallel algorithm over a serial 
one to be the time required by the seri- 
al algorithm divided by the time for the 
parallel algorithm. We will restrict 
our attention to parallel computers 
built as a tree of serial computers. A 


node in this tree is a processor, a 
parent is. a master, and a child is a 
Slave. Se | 


3. PARALLEL ASPIRATION. SEARCH 

In order to introduce parallelism, 
Baudet [2] rejects decomposition of the 
lookahead tree in favor of a parallel 
aspiration search, in which all slave 
processors search the entire lookahead 
tree, but with different initial d- 
windows. These windows are disjoint, 
and in the simplest variant their union 
covers the range from - co to +o. Since 
each window is considerably smaller than 
(- c0,+ oc), each processor can conduct 
its search more quickly. When the pro- 
cessor whose window contains the true 
minimax value of the tree finishes, it 
reports this value, and move’ selection 
is complete. Baudet analyzes several 
variants of this algorithm under the as- 


sumption of randomly distributed termi- 
nal values, and concludes that the ob- 
tainable speedup is limited by a con- 


stant independent of the number of _ pro- 
cessors available. This maximum is es- 
tablished to be approximately 5 or 6. 
Surprisingly, for k equal to 2 or 3, 
Baudet's method yields more than _ k-way 
speedup with k processors. Baudet 
infers that the serial d-B search algo- 
rithm is not optimal, and estimates that 
a 15 te 25 percent speedup may be gained 
by starting the search with a narrow 
window. 

Since a narrow window does not 
speed up a successful search when moves 
are ordered best-first, Baudet's method 
yields no speedup under best-first move 
ordering. 


4. THE TREE-SPLITTING ALGORITHM 
Another natural way to implement 
the qd-B algorithm on parallel processors 
divides the lookahead tree into its sub- 
trees at the top level, and queues them 
for parallel assignment to a pool of 
slave processors. The master processor, 
as in the serial algorithm, maintains 
the variable dq as the maximum of the 
negative of all subtree values. Each 
slave processor computes the value of 
its assigned subtree. The slave may use 
either serial d- Bp search or parallel q{-B 


search if it has slaves of its own. 
When it finishes, it reports the value 
computed to its master. As the master 
receives responses from slaves, it nar- 
rows its window, and possibly tells 
working slaves about the improved win- 
dow. When all subtrees have been 


evaluated, the master is able to compute 
the value of its position. A similar 
approach is discussed in [7]. 
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4.1 The Slave Algorithm 

The slave algorithm runs at termi- 
nal nodes of the processor tree. We 
will describe its interactions with its 
master by means of messages. The algo- 
rithm is equally easily expressed in a 
Shared-memory or call-return form. The 
Slave receives EVALUATE messages’ from 


its master, followed by any number of 
associated UPDATE messages that narrow 
its window. When an UPDATE message ar- 
rives, the slave adjusts its recursive 
values of dq and to what they would 
have been, had the search been started 
with the smaller window. When the slave 


has performed the search specified by 
the EVALUATE command, it sends a VALUE 
message back to its master, and then 
waits for another EVALUATE message. 
The algorithm calls’ five func— 
tions: 
Staticvalue (position) 
returns the static value of 
"position". 
Send (message) 
sends the data in buffer "mes- 


sage" to process message.dest. 
Receive (message) 

receives a message sent to this 

process, and places it in 

buffer "message". 
Catch(kind,future message,catcher) 

arranges for all future mes- 


sages with message.kind = 
"kind" to be immediately routed 
to buffer “message", bypassing 


Catch returns im- 
mediately, allowing the caller 
to proceed. Thereafter, when a 
message with the indicated kind 
arrives, the process is’ inter- 
rupted, and the routine "“catch- 
er" is called. When “catcher" 
returns, the process resumes. 
Slaves use catch to receive UP- 


any receive. 


DATE messageS without wasting 
time polling for them. 
Alphabeta(p) 


was defined in section 2. The 
variables dq and B are global 
arrays, not formal parameters, 
in order to facilitate updating 
their values in each’ recursive 
call of alphabeta when an UP- 
DATE message arrives. The glo- 
bal variable "depth" represents 
the level of p. 


The slave algorithm: 
program slave(); 
label DONE; 
var message, UpGatenessed: : 
record 
pos : position; 
q,B,value : integer; 
kind (EVALUATE ,UPDATE, VALUE) ; 
dest process; ee 


end; 
pos : position; . 
q,B : array[1..MAXDEPTH] of integer; 
depth : 1..MAXDEPTH; 
tmp : integer; 
succ : array[1l..MAXCHILD] of position; 
i,d : 1..MAXCHILD; | 
mymaster : process; 


procedure catcher; | 
{ called asynchronously by UPDATE } 
var scald,scalB,tmp : integer; 


k : 1..MAXDEPTH; 
begin 
scald := updatemessage.d; 
scalB := updatemessage.B; 
for k := 1 to MAXDEPTH do 
begin { update re arrays } 
QA{k] := max(d[k],scald); 
Bik] := min(B[k],scalB); 
tmp := scald; 
scald := -scalB; 
scalB := -tmp; 
end 
end; 


begin 
catch (UPDATE,updatemessage,catcher); 
while true do 
begin { 1 iteration per EVALUATE } 
receive(message); { receive EVALUATE } 
pos := message.pos; 
depth := 1; 
A[{[depth] := message.d; 
B[depth] := message.B; 
determine the children 
succ[l],...,succ[d]; 
if d = @ then 
{ evaluate terminal position } 
message.value := staticvalue(pos); 
else begin 
for i := 1 to d do 
begin { evaluate each successor } 
A[depth+l] := - Bldepth]; 
Bl[depth+1] := - d[{depth]; 
depth := depth+tl; 
tmp := — alphabeta(succ[i]); 
depth := depth-1; 
if tmp > d[{depth] then 
QA{depth] := tmp; 
if d{depth] > B[depth] then 
begin message.value := d[depth]; 
goto DONE; { cutoff occurs } 
end 
end { for i 
end; 

DONE: message.kind := VALUE; 
message.dest := mymaster; 
send (message) ; 

end { while TRUE do } 


of pos 


:= 1 to d do } 


end. { program slave } 
4.2 The Master Algorithm 

The master algorithm runs on non- 
terminal nodes of the processor tree. 


It receives EVALUATE and UPDATE messages 
from its master and VALUE messages from 
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After an EVALUATE mes- 
the master generates 
the position to be 
evaluated. Each slave is requested to 
EVALUATE one of these positions; the 
remaining positions are queued for ser- 
vice by slaves. Any UPDATE messages are 
relayed to active slaves. 7 
The master may take various actions 
it receives a VALUE message from a 
Slave. First, if the VALUE message 
causes the current qd value to increase, 
then -d is sent as an updated value to 
all active slaves. Second, if qd has 
been increased so that it becomes 
greater than or equal to B, then an d-B 
cutoff occurs. The nonpositive-width 
window is sent to all active slaves, 
quickly terminating them. Meanwhile, 
the master empties its queue of waiting 
successor positions. Third, if the 
queue of unevaluated successor positions 
is non-empty, the reporting slave is as- 
signed the next position from the queue. 
When all successors have been 
evaluated, the master sends a VALUE mes- 
sage to its master. In a game _ situa- 
tion, the algorithm at the root node 
might serve as the user interface, and 
would remember which move has the max- 
imum value. 


its slave nodes. 
sage is received, 
all successors of 


when 


Here is the master algorithm: 


program master(); 
label INIT; 
var message : 


record 
pos : position; 
Qd,B,value : integer; | 
kind : (EVALUATE,UPDATE,VALUE) ; 
dest : process; 
end; 
pos : position; 
succ : array[1l1..MAXCHILD] of position; 


succstat : array[1l..MAXCHILD] of 
(ASSIGNED, UNASSIGNED) ; 
1,d : 1..MAXCHILD; 
Slave : array[1l1..MAXSLAVE] of process; 
Slavestat : array[1..MAXSLAVE] of 
(BUSY, FREE) ; 
j : 1..MAXSLAVE; 
mymaster process; 
q,B,tmp : integer; 
begin | 
while true do 
begin { 1 iteration per EVALUATE } 
INIT: repeat { flush outdated UPDATES } 
receive (message); 
until message.kind = 
pos := message.pos; 
q := message.d; 
2= message.B; 
determine the successor positions 
succ[{1l],...,succ[d]; 
if d = 8 then 
begin { terminal node } 
message.value := staticvalue(pos); 


EVALUATE; 


for i := 


message.kind := VALUE; 
message.dest := mymaster; 
send (message); 

goto INIT; 


end; 
for j:= 


1 to MAXSLAVE do 
Slavestat[j] := FREE; 
1 to d do 


Succstat[i] := UNASSIGNED; 


while there exists a FREE slave j 


and an UNASSIGNED successor i do 


begin { give initial assignments } 


message.pos := succ[i]; 
message.q := —B; 

spite a -= =d; 
message.kind := EVALUATE; 


message.dest := slave[j]; 
send (message) ; 
Slavestat[j] := BUSY; 
Ssuccstat[i] := ASSIGNED; 


end; 
while there exist BUSY slaves do 
begin 


receive(mesSsage) ; 
if message.kind = UPDATE then 
begin { forward UPDATE message } 
if (message.d > qd) or 
(message.B < B) then 
begin 
Q := max(d,message.d) ; 
>= min(B,message.B) ; 
message.q := —B; 
epee ey 2= -d; 
message.kind := UPDATE; 
send(message) to all slaves; 
end 
if d > B then { cutoff } 
for i:=l to d do 
succstat[i] 
:= ASSIGNED; 


end 
else { message.kind = VALUE } 
begin 
- J 3:= answering slave; 
Slavestat[j] := FREE; 
tmp := -message.value; 


if tmp > qd then 

begin { send new q-B window } 
Qq := tmp; 
message.d := 
ee = = 
message.Kind := UPDATE; 


send(message) to all slaves; 
end; | 
if d > B then { cutoff } 


for i:=l to d do 
succstat[i] 
. := ASSIGNED; 
if there remains a successor, 
i, yet to be evaluated then 
begin { reassign slave } 
Slavestat[j] := BUSY; 
succstat[i] 
:= ASSIGNED; 
- message.pos := 
- message.d 
stein 
message. 


succ[i]; 


2 -—bF 
2= -qp 
ind := EVALUATE; 
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 unevaluated successor. 


message.dest := slave[j]; 
send(message) ; 
end 
end{ else message.kind = VALUE } 
end; { while there are BUSY Slaves } 
message.value := qd; 
message.kind := VALUE; 
message.dest mymaster; 
send(message) ; 
end{ while TRUE do } 


end. { program master } 
4.3 Alpha Raising 

As an optimization of the master 
algorithm, the master running’ on the 
root node may send a special q-B window 
to a Slave working on the last 


This window is 


(-d-1,-d) instead of the usual (-B,-d). 
If that successor is not the best, then 
the slave's search will fail high as 


usual, but the minimal window speeds its 
search. If that Successor is best, then 
the smaller window causes the search to 


fail low, again terminating faster. In 
either case, the root master determines 
which successor is the best move, even 


though its value may not be calculated. 
By speeding the search of the last suc- 
cessor, the idle time of the other 
slaves is reduced. (This narrow window 
given to the root's last subtree search 
can also be used in serial d-B search.) 

We can generalize this technique in 
the following way, called alpha raising: 
Suppose that, among slaves evaluating 
successors of the root, slave,'s current 
q value, Ah, is lower than any other, | 
and that Slave, has the second lowest d 
value, say d.% Update qd to d.,-l, 
speeding up Zlave . If Fits update 
causes Slave,'s otherwise successful 
search to fail low, then the reported 
value is still lower than all others, 
and that move is still discovered to be 
best. | 


5. MEASUREMENTS OF THE ALGORITHM 
MeaSurements of the performance of 
the tree-splitting algorithm have been 
taken on a network of LSI-11 microcom- 
puters running,under the Arachne operat- 
ing system [5] . The game of checkers 
was used to generate lookahead trees. 


z , 
We have been forced to change the name 
of the Roscoe distributed operating sys- 
tem, Since Roscoe is a registered trade- 


mark of Applied Data Research, Incor- 
porated. The new name we have chosen is 
Arachne; the operating system and 


research continue unchanged. 


Static evaluation was based on 
difference in a combination of material, 
central board position for kings and ad- 
vancement for men. Moves were ordered 
best-first according to their. static 
values. General d-raising was not em- 
ployed, except for the special case _ for 
the last successor. A single LSI-11 
machine searches lookahead trees at a 
rate of about 198 unpruned nodes per 
second. Inter-machine messages can be 
sent at a rate of about 7@ per second. 
Since only 5 processors are 
currently available in Arachne, it was 
not possible to test processor trees of 
depth greater than one directly. In- 
Stead, a depth-one processor tree was 
used to measure the speedup gained by 
replacing a terminal slave processor 
with a depth-one processor tree. When 
this slave is at level ae we call the 
measured speedup Y and Y3 were 
measured. The procedure ee measuring 
Vy made one simplifying assumption: 
Both a slave procesSor and a master pro- 
cessor below level zero can normally re- 
ceive UPDATE messages from their mas- 
ters. Due to the difficulty of dupli- 
cating the arrival times of these mes- 
Sages, they were not included in either 
the slave or the master-and-slaves case. 
(The master still gave its terminal 
slaves UPDATE messages.) 
Ten board ae aber ina 


eoeeg 


eee ee i 


were chosen for use 

ments. These positions ectualig arose 
Guring a human-machine game; they span 
the entire game. All lookahead trees 


from these positions were expanded to a 
depth of 8. 

Two sets of experiments were per- 
formed. The two differed only in that 
the first set used one master and_ two 
Slaves, while the second set used one 
master and three _ slaves. Within each 
experiment, Yo was measured directly for 
each B. by evaluating the tree both 
serially and with the parallel algorithm 
running on a depth-one processor tree. 
Table 1 summarizes measurements of Y 

The ten board positions gave rise 
to 84 successors, so 84 EVALUATE com- 
mands were given to slaves while Y was 
being measured. Times for both parallel 
and serial evaluation were measured _ for 
each command. The aggregate speedup for 
a group of commands is the total time 
required to execute them serially divid- 
ed by the total time required to execute 


them in parallel. For each B.,, the ag- 
gregate speedup Y for its ” subtree 
evaluations was cofiputed. Table 2 sum- 


marizes measurements of Yi. 


the 
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Should be considered the simplest 


Table l: Yo for each Bi, i=l,...,18 
2 slaves 3 slaves 
minimum Leo] Le37 
average 1.81 2.34 
maximum 2.36 3.15 
standard 
deviation 9.31 ~—6.56 
Table 2: Yi for each B,, i=l,...,19 
2 slaves 3 slaves 
minimum 1.83 1.38 
average 1.46 1.96 
maximum badd 2.60 
standard 
deviation 9.22 8.38 
Surprisingly, more than k-way 


speedup was occasionally achieved with k 
Slaves: Three out of the ten 8B, were 
sped up by. more than 2 with 2 Slaves, 
and two of those three were sped up by 
more than 3 with 3 slaves. Of the 84 
subtrees of the B.s, 4 were sped up by 
more than 2 with 2 slaves, and 9 were 
sped up by more than 3 with 3 slaves; 2 


of those achieved 6-way speedup. In. 
each such case, subtree evaluations fin- 
ished in a different order than they © 


were assigned. While one large subtree 
was being evaluated by one slave, anoth- 
er smaller subtree was assigned and fin- 
ished. The large subtree's evaluation 
then received an UPDATE message that 
sped it up or even terminated it. In 
fact, time-consuming searches are more 
likely than short ones to receive these 
messages. In particular, the search 
that receives the final (-d-1,-d) window 
is likely to be larger than average. 


OPTIMIZATIONS 


6. 
Since the tree-splitting algorithm 
can be optimized in several ways, it 


vari- 
ant of a family of tree-decomposing al- 
gorithms for q-B search. As a first op- 
timization, since most of a master's 
time is spent waiting for messages, that 
time could be spent profitably doing 
subtree searches. However, only the 
deepest masters could hope to compete 
with their slaves in = conducting 
searches. All other masters are by 
themselves slower than their slaves. be- 
cause their slaves have _ slaves below 
them to help. However, more than half 
of all masters control terminal slaves, 
and greater speedup should be achieved 


running a Slave algorithm along with 
these masters on the same _ processors. 
We might expect an additional 1.5-way 
speedup from this technique. 

A second optimization groups 
several higher-level masters onto a sin- 
gle processor. For example, the 3 
highest processors in a binary processor 
tree could be replaced by 3. processes 
running on a single processor. 

Third, a master might evaluate a 
position by assigning that position's 
successor's successors to slaves, rather 
than that position's successors. 
Although this technique involves’ more 
message-passing, some advantage might 
result, because all of a master's slaves 
would work on finishing the position's 
first subtree before going on to. the 
second. The evaluation of the second 
subtree would then receive the full 
benefit of the beta value generated by 
the first subtree. Furthermore, when 
Slaves become idle as _ one subtree is 
finished, they can immediately be set to 
work on the next subtree. 

Since most game-playing programs 
must make their move within a certain 
time limit, any speedup in tree _ search 
ability will generally be used to search 
a deeper lookahead tree. If we have an 
unlimited supply of processors to form 
into a binary tree, we can obtain an un- 
limited speedup only if the search is 
not limited in time. Otherwise we can- 
not, because we would eventually violate 
our premise that the lookahead tree is 
at least as deep as the processor tree. 
A new layer on the processor tree does 
not buy another full ply in the looka- 
head tree. For example, several speed- 
ups of 1.5 would be needed to search a 
6-times larger chess lookahead tree, or 
about one additional ply. The depth of 
the processor tree would grow faster 
than the depth of the tree it searches 
and eventually would catch up. The only 
way to avoid this limit is to increase 
the fan-out of the processor tree. If 


by 


the fan-out is high enough that no Suc- 
cessor need ever be queued for evalua- 
tion by ae slave, then the size of the 
maximum lookahead tree that can be 
evaluated within the time limit is lim- 
ited only by the time required for 
EVALUATE commands to propagate from the 


root to the leaves. Long before this 


limitation is reached, we would run out 
of silicon for making the processors. 


7. ANALYSIS OF SPEEDUP 

We will now analyze the speedup 
that can be gained in searching large 
lookahead trees as the number of avail- 
able processors grows without’ bound. 
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For this purpose we introduce 
ta, a simplified 


Palphabe- 
version of the tree- 
Splitting algorithm. This algorithm is 
less efficient than the version already 
discussed, but is more amenable to 
analysis. Much of the analysis in this 
section is a "parallelization" of 
results in [3]. Indeed, when gq = 9 and 
f = 1, Theorem 1 and Corollary 1 reduce 
to results given by [3]. 

As before, the processors 
arranged in a uniform tree. Let f > 1 
be the fan-out of the processor tree 
(uniform for all non-terminal nodes), 
and let q > 1 be its depth (uniform for 
all terminal nodes). Let q + s be the 
depth of the lookahead tree, where Ss D> 
ue We assume that the lookahead tree 
has a uniform degree and that this de- 
gree, df, is a multiple of f£, where d is 
De 2 


will be 


The £ function calls specified in 
the first line of the for-loop are in- 
tended to occur in parallel, activating 
functions existing on each of the f 
Slaves. Unlike the tree-splitting algo- 
rithm, Palphabeta waits until all slaves 


finish before assigning additional 
tasks. Serial d-B search is activated 
on leaf slaves; Palphabeta is activated 
on all others. Here is the simpler 


parallel d-B algorithm. 


function Palphabeta(p : position ; 
qd, B : integer) : integer ; 


var i: integer; 
function g integer; 
begin 


determine the successors Pye cee Page 
begin 7 
if depth(p,) < q then 
g := Palphabeta 
g := alphabeta; 
1 to d do 


: Max . ~O(P3,-Br-d) ); 
(PLES IC Ls 
if d > B then go to DONE; 
end; 
Palphabeta:= d; 


7.1 Worst-first ordering 
Q-B search produces no cutoffs if, 
whenever the call alphabeta(p,d,B) is 


made, the following relation holds among 
the successors Pir see ¢Pg: 


Qi< ~negamax(p,)<...< ~negamax (pq) < B. 


We call this ordering worst first. it 
no cutoffs occur, it 1s easy to calcu- 
late the time necessary for Palphabeta 
to finish. Assume that a processor can 
generate f successors, send messages’ to 


all of its £ slaves and receive replies 
in time op. (This figure counts message 
overhead time but does not include com- 


putation time at the _ slaves.) Assume 
also that the serial qd-B algorithm takes 
time n to search a lookahead tree with n 
terminal positions. Let a_ be the time 
necessary for a processor at distance n 
from the leaves to evaluate its assigned 
position. A leaf processor executes the 
serial algorithm to depth s. Thus we 
have a, = (df) *. An interior processor 
gives d batches of assignments to its 
Slaves, and each batch takes time plus 
the time for the slave processor to com- 
plete its calculation. Thus we have 
ae, d(pta,). The solution to this 
recurrence relation is 


which is the total time 
to complete. Since the time for the 
serial algorithm to examine the same 
tree is (df) 4 a the speedup for large s 
is £4. There are (fot -1)/(£-1) proces- 
sors, roughly £7, so when no pruning oc- 
curs, the parallel algorithm yields 
Speedup that is roughly equal to the 
number of processors used. 


for Palphabeta 


7.2 Best-first ordering 

We will now investigate what hap- 
pens when the lookahead tree is ordered 
best-first. We omit the proofs of 
Theorems 1 and 2 in the interests of 
conciseness. Full details may be found 
in [6]. 
Definition: We will use the Dewey de- 


cimal system to name nodes in both pro- 
cessor trees and lookahead trees. The 
root is named by the null string. The j 
Successors of a node whose name is 
A,++-4, are named by a,...a,1 through 


Ayers, Je 


Definition: We say that the successors 


of a position Aj,-+-a, are in best-first 
order if 
negamax (a,...a,) = ~negamax(a,...a,1). 


Definition: We say a position A, +--a 
in the lookahead tree is (q,f£)-critical 
it a; is (q,f)-restricted for all even 
values of i or for all odd values of i. 
An entry a; is (q,f)-restricted if 

ae <q andil< a; < SE 


Of Gf Gg <1 and 4; = l, 

Theorem 1: Consider a lookahead tree 
for which the value of the root position 
is not + o and for which the successors 
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of every position are in best-first ord- 


er. The parallel d-B procedure Pal- 
phabeta examines exactly the (q,f)-. 
critical positions of this lookahead 
tree. ; ) 


Corollary 1: If every position on lev- 


els @,1,...,q+tS-1 of a lookahead tree of 


of 
for 


depth g+s satisfying the conditions 
Theorem 1 has exactly df successors, 
d some fixed constant, and for £ the 
constant appearing in Palphabeta, then 
the parallel procedure Palphabeta (along 
with alphabeta, which it calls), running 
on a processor tree of fan-out f£ and 
height gq, examines exactly 


pL9/24 agyl (ats) /27 4 
gh 9/2 TagyL (ats) /24 _ 


£4 terminal positions. 


Proof: There are p-9/24 (ggyl (ats) /21 
sequenceS 4)..-Agier with l<a,<df for 
all i, such that a" is (q,f)-restricted 


for Ll en _ Vy hues of i; there are 
cP 4/27 ge) ECGs) 72 such sequences”) with 
a. (q,f£)-restricted for all odd values 


of i; and we subtract f for the se- 
quences lsat oky Ul ¢ that we counted 
twice. 

| : Q.E.D. 
Theorem 2: Under the conditions of 


Corollary 1, and assuming also that (1) 


serial q-B search is performed in time 


equal to the number of leaves visited, 
and (2) in units of time, a processor 
can generate f successors of a position, 
send a message to each of its £ slaves, 
and receive the £ replies, then the to- 
tal time for Palphabeta to complete is 


caf) b8/24 4 (ae) P8721. 1 4 


h(a) [d(3p+ (ae) bS/244 (ae) 8/21) + 
- (ae) b8/24_ (ae) F872 1) — og, 


if q is even; 


cazybS/24 4 cag P8721 4 4 
h(q-1) [d(3p¢ (d£) 8/244 (ae) $/2 1) +p 


-(ae) b8/24_ (qe) F872 1) - 5g 


+alI1)/2 1a (p+ (at) b8/24) 4p- (ae) bS/24), 
if q is odd; 


where the function h is defined by 
h(q) = (4% - 1y/(d - 1). 
Under conditions of best-first 


search, the parallel q-B algorithm gives 
O(vk ) speedup with k processors for 


searching large lookahead trees. 


Theorem 3 formalizes this result: 
Theorem 3: Suppose that Palphabeta runs 
on a processor tree of depth gq > 1 and 
fan-out f > 1. Suppose that the looka- 
head tree to be searched is arranged in 
best-first order and is of degree df and 
depth q+ts, where d > 1. Denote by R the 
time for alphabeta to search this tree, 


and by P the time for Palphabeta to 
search the tree. Then 
LIM R/P = £92, 
S -> © | 
Proof: The time for the serial algo- 
rithm is 


caeyL (Sta) /24 4 (gg) P (StM 7/27 2 3) 


If we divide this 
quantity by the expression given by 
Theorem 2 for P, and take the limit as s 
goes to oo, we obtain the desired result. 


from Corollary l. 


O.E.D. 

7.3 Discussion 
The measurements presented in sec- 
tion 5 fall within the range bounded by 
the theoretically-predicted best-first 
and worst-first speedups. If we take 


YnY¥, to be the speedup that would be 
g f@ven by a processor tree of depth two, 
the measured speedup for two, 
three, four, and nine terminal proces- 
sors is 1.81, 2.34, 2.64, and 4.59 
respectively. Theory predicts speedup 
egual to the number of terminal proces- 
sors for worst-first ordering. Best- 
first speedup is predicted to be the 
square root of the number of terminal 
processors, or 1.41, 1.73, 2, and 3 
respectively. 


then 


8.. ACKNOWLEDGMENTS 

The authors gratefully acknowledge 
the help and ideas offered by Karl 
Anderson, Will Leland, Marvin Solomon, 


243 


and Larry Travis. 


9. REFERENCES 


[1] Berliner, H.J., "A Chronology of 
Computer Chess and its Literature," 


Artificial Intelligence, Vol. 190, 
1978, (April, 1978), pp. 201-214. 
[2] Baudet, G.M., The Design and 
Analysis of Algorithms for Asyn- 
chronous Multiprocessors, Depart- 
ment of Computer Science, 


Carnegie-Mellon University Techni- 

cal Report, (April, 1978), 182 pp. 
[3] Knuth, D.E., R.W., "An 
Analysis of Alpha-Beta Pruning," 
Artificial Intelligence, Vol. 6, 
No. 4, (Winter, 1975), pp. 293-326. 


and Moore, 


A.L., "Some Studies in 
Learning USing the Game of 
Checkers, II - Recent Progress," 
IBM Journal of Research and 
Development, (November, 1967), pp. 
601-617. 


[4] 


Samuel, 
Machine 


M. H., Finkel, R.A., "The 
Roscoe Distributed Operating Sys- 
tem," Seventh ACM Symposium on 
Operating Systems Principles, (Dec. 
1979) « 


[5] Solomon, 


[6] Fishburn, J.P., Finkel, 
Lawless, SsAes Two on 
Alpha-Beta Pruning, (Revised) 
Department of Computer Sclence, 
University of Wisconsin-Madison 
Technical Report, (June 198@), 33 


PP- 


R.A., and 
Papers on 


[7] Akl, S.G., Barnard, D.T., 
Revde7 Searching Game 
Parallel, Department of 
and Information Science, 
University Technical Report, 
1979), 36 pp. 


and Doran, 
Trees in 
Computing 
Queen's 

(Nov. 


TWO PARALLEL ALGORITHMS FOR SHORTEST 
PATH PROBLEMS 


Narsingh Deo 
C. Y. Pang 
R. E. Lord 


Computer Science Department 
Washington State University 
Pullman, Washington 99164 


ABSTRACT 


After examining several dozen serial algorithms 


and their variations for various shortest-path 
problems, two algorithms were selected as good 
candidates for parallelization on an MIMD-type 


processor. These are: (1) Pape-D'Esopo version of 
the Moore's algorithm for finding shortest paths 
from one node to all others, and (2) Warshall-Floyd 
algorithm for finding shortest paths between all 
pairs of nodes. The techniques used in designing 
the two. parallel aigorithms are fundamentally 
different--one involves parallel processing with a 
queue and is suited for sparse networks while the 
other employs matrix methods and is suited for 
dense networks. The correctness of these 
algorithms is proved. Execution times are analyzed 
and compared with actual execution times on the HEP 
computer (an MIMD machine). 


1. INTRODUCTION 


Shortest-path problems are by far the most 
fundamental! and also the most commonly encountered 
problems in the study of transportation and 
communication networks. Often the repeated 
determination of shortest paths and distances form 
the core (inner loop) in many transporation planning 
and utilization packages. Therefore, the search for 
faster and faster  shortest-path procedures 
continues. After reviewing over 200 papers on 
shortest-path algorithms and after classifying and 
analyzing several dozen existing algorithms [5], two 
points became evident to us (among other things): 
(1) the shortest-path problems have almost reached 
their theoretical bounds of speed if conventional 
serial computers are to be used; and (2) certain 
algorithms (which may be most suited for serial 
mode) cannot be "parallelized" as readily as others. 
For example, Dijkstras algorithm [4, 7, 18] for 
finding a shortest path between two nodes is not as 


well suited for parallelization as the Bellman-Moore 


[5, 14, 21] algorithm is. 
We have selected two algorithms (for solving 


two different shortest-path problems), which appear 


to us as the best candidates for parallelization, for a 
detailed presentation in this paper. These are: (1) 
Pape-D'Esopo version of the Moore's algorithm for 


This work was supported by U.S. Department of 
Transportation contract no. DOT-RC-92042 and by 
NSF grant no. MCS78-25851. 
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finding shortest paths from one node to all others 
[14, 15] and (2) Warshall-Floyd [4, 10, 18] 
algorithm for finding shortest paths between all 
pairs of nodes. The techniques used in designing 
the two. parallel algorithms are fundamentally 
different--one involves parallel processing with a 
queue and is suited for sparse networks while the 
other employs matrix methods and is suited for 
dense networks. 

We designed parallel versions of these two 
algorithms, suited for an MIMD (multiple instruction 
multiple data stream) [11] machine--keeping an eye, 
in fact, on the characteristics of the specific MIMD 
machine on which the designed parallel programs 
were actually to be executed. For example, on this 
machine the time required in creating a process is 
greater than the time needed to lock or unlock a 
resource. 

In recent years, MIMD machines are not only 


being built experimentally in | university 
laboratories, but they are being built in private 
industries. The Heterogeneous Element Processor 


(HEP) of DENELCOR Inc. [20], and the SMS 201 of 
Siemens AG [12] are two examples of commercial 
MIMD machines. Since the HEP was availabie to us, 
we coded and executed our programs on the HEP 
and performed the timing study on it. 

Although a number of theoretical studies have 
been reported on parallel processing of graphs [1, 
8, 9, 13, 17, 19], very few of them have considered 
the specific problems of shortest path problems and 
none have actually designed, coded and executed a 
parallel shortest-path algorithm on a real parallel 
computer (particularly on an MIMD computer) to the 
best of our knowledge. This study considers many 
of the real nuts-and-bolts issues of parallelization of 
existing algorithms, data structures, efficiencies 
and speed-gains over the serial implementations. 


in Section 2, we will give definitions relevant 
to shortest paths on a network. In Section 3, we 
design a parallel algorithm for finding sortest paths 
from one specified node to ail other nodes in a given 
network. The proof of correctness of the algorithm 
and the details of our model of computation are also 
given in Section 3. In Section 4, we present the 
second algorithm--for finding shortest paths 
between all pairs of nodes in a given network. The 
proof of its correctness and some empirical results 
on execution time are also presented in Section 4. 


2. SOME DEFINITIONS 


The following are the definitions of some of the 
important graph-theoretic terms used in this paper. 
Definitions for the rest of the terms can be found in: 


any textbook on graph algorithms or networks [4, 


18]. A directed graph G = (V, E) is an ordered 
pair of finite sets: V of nodes, and E of arcs. We 
will use NODES to denote the number of nodes in V. 
We will also use {1, 2, . . . , NODES} to denote the 
elements of V. And arc a.in E is an ordered pair, 
(u, v), of nodes. An are a= (u, v) is said to start 
at u and end at v. A network is a directed graph, 
G, together with a real valued function, &%, on the 
set of arcs. For any arc a, &(a) is the arc length 


of a. An arc length matrix has its (u, y)th entry 
as £(u, v) if the arc (u, v) exists. The entry is « 


if (u, v) does not exist. A path P is a finite 
sequence of arcs P = (a,, agree ay), such that 
a, starts where a,_, ends, fori=2, ..., k. The 
length d(P) of a path P is defined to be 


d(P) = a(a, ae * e(ay). If a = (u;_,, ui), we 
will, in addition, use (up, Ure e ey uy) to denote 
P, and P is called a path from Uy to Uy. A path 


that starts and ends at the same node is called a 
cycle. A cycle with negative path length is called a 


negative cycle. P is a shortest path from u to v if 
d(P) is minimum over the length of all paths from u 


to v; the shortest distance from u to v is then 
d(P). The one-to-all shortest path problem is the 


problem of finding the shortest paths from a given 
node, called the source, to all the other nodes, the 
destinations. The all-to-all shortest path problem 
is the problem of finding a shortest path for every 
pair of nodes in the network. 


3. A PARALLEL ALGORITHM FOR THE ONE-TO- 
ALL SHORTEST-PATH PROBLEM 


A modification of Moore's algorithm [14] by 
D'Esopo as reported in [16] was further developed 
by Pape [15] into two very efficient codes for 
finding shortest paths from a specified source node 
to all other nodes in the given network. This Pape- 
D'Esopo-Moore algorithm, which we will refer to as 
PDM algorithm, may be described in an Algol-like 
language as follows: : 


Algorithm PDM 
for all u # SOURCE do 


1 
2 Df{uj :=@; 

3 D[SOURCE]}] :=0; 

4 initialize Q to contain SOURCE only; 

9 while Q is not empty do 

6 begin 

7 ~~ delete Q's head node u; 

8 for each arc (u, v) that starts at u do 
9 if D[v] > D[u] * e(u, v) then | 


10 begin 

11 —P[v] :=u; | 

12 D[v] := D[u] + e(u, v); 

13 if v was never in Q then 

14 insert v at the tail of Q; 

15 if v was in Q, but not currently then 
16 insert v at the head of Q 

17 end 

18 end 
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During the. execution of Algorithm PDM, the 
label D[u] is always updated to be the currently 
known shortest distance from SOURCE to u, and 
P[u] is always updated to be the predecessor node 
of u on the currently known shortest path from 
SOURCE to u. Since each insertion of a node u into 
Q is preceded by a decrement of D[u], this 
algorithm is guaranteed to terminate provided the 
input network has no negative cycles. 

To see that the D[u]'s do indeed converge to 
the shortest distances, we first note that at 
termination D[v] < D[u] + 2(u, v) holds for every 
arc (u, v). Suppose the node sequence (SOURCE = 
Ug, Uyr es sr Uy = u) is a path from SOURCE to u, 


then its path length is given by 


&(Up, uy) RS ae ge a(uy 4, ur) 
2 (-D[up] + D{u,]) a ae (-D[u, _4] + D[u,]}) 
= -D[SOURCE] + D[u] = D[u]. 


Thus, D[u] is the shortest distance from 
SOURCE to u, and the node sequence, 


SOURCE = P[ ... P[u] ... ], ...., P[P{u]],P[u],u 


is the shortest path from SOURCE to use as obtained 
by Algorithm PDM. 

The experiments of Denardo and Fox [2], Dial, 
Glover, Karney and Klingman [3], Pape [8], and 
Vliet [11] show that on the average Algorithm PDM 
is faster than almost every other shortest-path — 
algorithm, if the input network has a low arcs to 
nodes ratio. We will, therefore, base our parallel 
algorithm on Algorithm PDM. 


Let us fix our model of parallel computation 
before developing parallel algorithms. We will 
assume that our computer can simultaneously 
execute up to K processes. The communication 
between the processes is done via a common memory. 
The computer supports the operations: create, 
lock, and unlock [pp. 77-78 of Ref. 2]. When a 
process Py executes the statement “create process 


Po,” Po will start execution and P, will continue. 
For a memory xX, after process P, executes “lock 
X,'" any other process that attempts to read, write, 
or lock X will have to wait until P, executes an 


“unlock X." Our model of computation is a realistic 
one; for the HEP computer can simultaneously 
execute processes, it has a common memory for all 
the processes, and it supports the operations 
create, lock, and unlock efficiently. 


For practical reasons, we will assume that 
create, lock, and unlock take non-zero units of time 
to execute. In designing our algorithm, we also 
assume that create requires a longer execution time 
than lock and unlock. This assumption is also 
realistic, because create in the HEP machine using 
the FORTRAN language is implemented with four 
instructions, whereas only one machine instruction 
is required for implementing lock or unlock. 


An obvious way to utilize the concurrent 
processing in Algorithm PDM would be to execute 
the inner for loop (statements 8 to 17) 


simultaneously. But this approach is unprofitable 


because the overhead for a create is high compared. 


to the execution of one pass of the loop. -Moreover, 
in this approach the maximum number of concurrent 
processes utilized would be about four, if the input 
is a typical road network (with outdegree = = 4). 
Therefore, we will avoid breaking the inner for loop 
into different processes; instead we will distribute 
the passes of the while loop (statements. 5 to 18) to 
different processes. This will avoid excessive use 
of create's. 


We will use only K-1 create’s to obtain a total of 
K concurrent processes at the beginning of the 
algorithm, and use lock's and unlock's to take care 
of the rest of the synchronization. During the 
execution of the algorithm, the K processes--one 
called MASTER and the others called WORKERs-- 
share. the computation load, as long as there are 
known tasks to be performed. Each process takes 
approximately 1/K of the work load in the 
initialization step. In the path-finding step, each 
process repeatedly deletes a node, u, from Q, and 
updates P[v]'s and D[v]'s for the successors, v's, 
of u. In addition to a WORKER’s tasks, the 
MASTER is responsible for finishing the initialization 


step and for synchronizing the _ initiation and 
termination of the path-finding step. Our parallel 
algorithm, which we will refer to as PPDM, is as 
follows: 
Algorithm PPDM (Parallel Pape-D’Esopo-Moore) 
Process’ MASTER 
1 MSYN := "yes"; WAIT :=0; DONE’ := 0; 
Z for i := 2 step 1 until K do 
£3) Create process WORKER(i); 
4 foruoc uo := 1 step K until NODES do 
5 Du] := 0; 
6.L1: if WAIT < K - 1 then. goto L1; 
7 D[SOURCE] :=0; 
8 initialize Q to contain SOURCE only; 
9 L2: lock Q;. 
10 if Q is empty then goto L3; 
11 delete Q's head node u; 
12 unlock Q; | 
13 MSYN_) := "no"; 
714 reach: successor nodes of u (Block B); 
15 MSYN) := "yes"; 
16 = goto L2; 
17 L3: if WAIT = K - 1 men goto L4; 
18 unlock Q; 
19 = goto L2; 
20 L4: DONE := 1; 
21 unlock Q; 
22 L5: if DONE < K then ao LS 
Process WORKER(i) 
1 for u := i step K until NODES do 
a D[u]  := @; ; : 
3 L1: if MSYN := "yes" then gate L3; 
4 lock Q; 
i) If Q in empty then goto. L2; 
6 delete Q's head nox node u; 
7 unlock Q; = & 
8 reach | successor nodes of u ‘(Block B); 
9 goto L1; 
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written in lower case letters, 


10 L2: unlock Q; 
11 goto L1; 
12 L3: lock WAIT; WAIT: -WAIT+1: unlock WAIT: 
13 L4: if DONE > 0 then goto L5; 
14 if MSYN = "yes" then goto L4: 
15 lock WAIT; WAIT:=WAIT-1; unlock WAIT; 
16 goto L1; 
17° LS: lock DONE; DONE: =DONE*1; unlock DONE 
Block B 
1 for each arc (u, v) that starts at u do 
2 begin ; — 
3 newdv := D[u] + g(u, v); 
4 lock D[v]; | 
5 if D[v] < newdy then 
6 unlock D[v] 
vi else begin 
8 Ply] := u; 
9 D[v] := newdvy; 
10 unlock D[v]; 
11 lock Q; 
12 if v was never in Q then 
13 insert v at the tail of Q; 
14 if v was in Q, but is not currently then 
15 insert v at the head of. Q; ) 
16 unlock Q 
V7 end 
18 end 
Note For Block B. of the MASTER process, 
statement 11 should be changed to: | 
11 MSYN  := "yes"; lock Q; MYSN := "no"; 


the local variables are 
they are i, u,.v, and 
newdv. The variables MSYN, WAIT, and DONE are 
the communication links between the MASTER and 
the WORKERs. MSYN = "yes" signals the WORKERs 
to let the MASTER check the Q first. WAIT is the 
number of WORKERs waiting for further command 
from the MASTER (i.e. WAIT is the number of 
WORKER processes which are executing. statements 
13 and 14). DONE is used by the MASTER to 
broadcast the termination signal. This algorithm 
requires the processes to keep on processing Block 
B until Q is empty. Block B is equivalent to 
statements 8 to 1/7 of Algorithm PDM. The locking 
and unlocking of D[v] and Q are added in Block B 
to ensure that Algorithm PPDM computes correctly. 


In Algorithm PPDM, 


Proof of correctness 


We will now informally prove the correctness of 
this algorithm. It ‘is easy to see that the 
initialization step is correct. For the path-finding 
step, we will first state and prove six remarks to 
show that the algorithm terminates for all networks 
which have no negative cycles. , 


[v] is nonincreasing 


Remark 1: For any node v, ‘D 
with time. — | 
Remark 2: Each finite D[v] represents the length 
of a path from SOURCE to v. 
Remark 3: Only a finite number of insertions are 


made into Q. 


Remark 4: Every execution of Block B= always 
terminates. 

Remark 5: There exists a time, t,, such that the 
MASTER process will not execute Block 
B and MSYN = "yes" for all time after 
t2; 
1 

Remark 6: Algorithm PPDM terminates. 


To see that D[v] is nonincreasing, one simply 
observes that D[v] only changes when it is locked, 
and the changes are always decrements. To see 
that each finite entry D[v] represents a path 
length, we use induction on the time sequence of the 
change on the array D[e]. Let t, be the time 


immediately after D[SOURCE] is initialized to zero, 
and let t.,, be the time immediately after the first 


change (or changes) in D[e¢] after ti, fori=1, 2, . 
At time t,, D[SOURCE] = 0 is the only entry of 


D[*] with a finite value, and 0 is the path length of 
the null path from SOURCE to SOURCE. Suppose 
for all time t < t, each finite D[v] represents a path 
suppose D[v] is 
Assume that the 


length from source to v, and 
changed immediately before tad: 


change in D[v] is caused as we fan out from u, and 
that the value of D[u], at the time of its reading 
statement 3 in Block B, is the path length of 
(SOURCE = Ug, Uy, Be ag uj u). At time t. 


i*1’ 
D[v] is the path length of (up, Uy, 


; us, v). 
Thus, Remark 2 follows by induction. 

To see that Remark 3 holds, we first notice 
that each D[v] is bounded from below, because the 
D[v]'s represent path lengths and the input 
network has no negative cycles. Secondly, we 
notice that there are only finitely many decrements 
to the D[v]'s, because each decrement decreases a 
D[v] by at least the minimum length difference 
between two loopless paths. Thus Remark 3 follows, 
since each insertion into Q implies a previous 
decrement of a D[v]. 

We will prove Remarks 4 and 5 together. To 
prove Remark 4, it suffices to show that no 
indefinite waits occur at Block B's statements 3, 4, 
and 11. By Remark 3, we see that Block B can be 
executed for only finitely many times. Thus every 
waiting at statements 3 and 4 takes a finite time. 
Because Q can be locked outside Block B, more 
arguments are needed to show that no indefinite wait 
occurs at Block B's "lock Q" statement (statement 
11). We will prove a stronger result that no 
indefinite wait can occur at any "lock Q" statement 
in Algorithm PPDM. The MASTER always sets MSYN 
to "yes" before it executes "lock Q", and when 
MSYN is "yes" all WORKERs will be blocked from 
entering statements 4 to 11 and Block B. Thus the 
MASTER has no indefinite wait at "lock Q", and that 
its executions of Block B take finite time. Before 
we prove similar results for the WORKERs, we first 
prove Remark 5. ‘t is easy to see that the loop of 
the MASTER's statements 9 to 16 has no indefinite 
wait. We claim that the loop of statements 9, 10, 17, 
18, and 19 has no indefinite wait also, for if the 
MASTER is waiting at statement 17, then MSYN 
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would have the value "yes", and consequently, only 
finitely many short lockings of WAIT can occur at 
the WORKERs: statement 12. Since indefinite wait 
does not occur at the MASTER process, and there 
are only finitely many insertions into Q, we conclude 
that eventually the MASTER will never enter Block 
B. We have just proved Remark 5. To finish the 
proof of Remark 4, we assert that the WORKERs 
have finite waiting time for executing the "lock Q" 
statements. Suppose the converse is true, and j 
WORKERs are waiting indefinitely at the "lock Q" 
statements (i.e. WORKER's statement 4 or Block B's 
statement 11). By Remark 5, the MASTER will 
eventually be looping at statements 9, 10, 17, 18, 
and 19. Each time the MASTER executes “unlock Q", 
statement 18, one of the j waiting WORKERs is 
allowed to finish executing “lock Q", which is a 
contradiction. 

To prove Remark 6, we first recall that every 
execution of "lock Q" takes a finite waiting time. 
From Remark 3, we see that Q will eventually be 
empty and WORKER will not execute statements 6 
to 9. By Remark 5, MSYN eventually has the value 
"yes", therefore all WORKERs are directed to the 
loop of statements 14 and 15. Consequently, 
Algorithm PPDM terminates. 


Now we prove the correctness of the outputs, 
Die], and P[*]. We use D, fu] and P, fu] to denote 


the values of D[u] and P[u] at time t, and use z to 
denote the termination time. We first claim that 
Dilv] < Do{u] + g(u, v), for each arc (u, v). 
Suppose (u,, v4) is an arc of the input network. 
Let a be the time of the last deletion of U, from Q. 
Consequently, Block B is executed for Uy after time 
a. The processing of the arc (uy, v1) includes the 


execution of either statements 5 and 6, or 
statements 5, 8, 9, and 10. Let b be the time of the 
execution of "unlock Div)", at statement 6 or 10. 


Since the last deletion of Uy, occurs at a, it is easy 
to see that D[u,] stays constant after time a. 
Do[v,] Ditv,] D [uy] - 
e(u,,Vv,). Having proved D[v] < D[u] + e(u, v) for 


< 


S 


Consequently, 


all arcs (u, v), we conclude that the Dju]'s are the 

shortest distances by the same argument that was 

used for the proof of correctness of Algorithm PDM. 
To prove that for each u, 


(SOURCES Png RLU) Galina, /P (ul, u) 


is a shortest path, it suffices to show that for each 
Vy" if uy = P_tv,] then D_[v,] = D_[d,] + a(u,, 
V4), for it says that a shortest path from SOURCE 
to u, concatenated with (u,, V1) forms a shortest 
path from SOURCE to V1: , 
defined as before. It is easy to see that Div,] is 


Let time a and time b be 


decreased in that execution of Block B, and so 


Dytv,] = Do [u,] * £(u,, v1). Finally, we see that 


Dotv,] Di [vy]. 
after time b implies a change in Pity] = uy. 


because any change of D[v,] 
This 


completes the proof of correctness of Algorithm 
PPDM. 


Algorithm PDM and Algorithm PPDM were coded 
to run on the HEP computer. The programs use 
linked queue, which is used in Pape [15], and Dial, 
Glover, Karner, and Klingman [6]. The input 
network is stored in a linked list structure called 
the forward star form, used also in [6]. Timing 
experiments were performed with randomly 
generated connected networks. Following the 
characteristics of the Eastern Washington Highway 
Network, the generated networks were assigned 
exponentially distributed arc lengths and have 
approximately 35% of nodes outdegree of one, 9% of 
nodes outdegree of two, 40% of nodes outdegree of 
three, and 16% of nodes an outdegree of four. 
Highway networks usually have all two-way roads, 
and so do generated networks. For each NODES = 
10, 25, 50, 75, 100, we generated two networks. 
For each network, we picked five source nodes. 
Each of these 100 problems are solved with the 
sequential Algorithm PDM, and the parallel version, 
Algorithm PPDM, with the number of processors K = 
1 to 8. Let Ty denote the solution time for the 


sequential algorithm, and Tye denote the solution 


time with the K-processor, parallel algorithm. For 
each problem, the speed-up Sk = T/T, and the 


efficiencies, Ex = S,./K, are computed. 


NODES and K, the averages of -S)'s and Ey's are 


plotted in Figure 1 and Figure 2, respectively. For 
NODES = 75 and 100, we see that a speed-up of 
approximately three is achieved’ with five 
processors, and thus an approximate efficiency of 
60%. However, regardless of the number of 
processors used, we expect that Algorithm PPDM 
has a constant upper bound on its speed-up, 
because every process demands private use of the 


Q. 


4. A PARALLEL ALGORITHM FOR THE ALL-TO- 
ALL SHORTEST PATH PROBLEM 


The best Known algorithm for determining 
shortest paths between all pairs of nodes is. due to 
Floyd [10], which in turn is based on an earlier 
algorithm for transitive closure proposed by 
Warshall [4]. : 

The basic idea 
expressed as follows: 


of the algorithm may be 


Algorithm F 
1 fork := 1 step 1 until NODES do 
2 fori := 1 step 1 until NODES do 
3 for Jj. 5= lus — T until NODES do 
A if DIi, j1 > DUi. Kk] * Dik, j] then 
3 Di, j] := Dli, k] * D[k, jJ 


D[e], is initialized to be the arc 
If the input network contains no 


| The matrix, 
length matrix. 


For fixed 


negative cycle element D[i, j] at the termination is 
the shortest distance from u to v; because at the 


end of the th iteration, D[i, j] is updated to be the 
shortest distance from i to j via paths as have 
intermediate nodes which are contained in {1, : 

. , K}. We will show that the inner loops of ae S 
algorithm may be computed in parallel as follows: 


Algorithm PF (Parallel Floyd) 


1 fork := 1 step 1 until NODES do 

2 for 7 < i, j s NODES do s simultaneously 
3 if Di, 5 > D[i, k] + D[k, j] then 

4 D[i, j] := D[i, k] + D[k, jJ 


To prove that Algorithm PF is correct, we use 
the theory developed for controlling concurrent 
processes in operating systems. In particular, we 
use the definition and results in Chapter 2 of [2]. 

We first informally review some definitions. A 


task system C = (1, ¢) is a set of tasks, 1 = {T, 
To, 2 ge or TA}, together with a precedence 


relation, <, where T < T' means that T must be 
completed before T' begins. Any execution 
sequence of C must obey the precedence relation. 
Each task T is associated with two subsets, the 
domain D-. and the range Re, of the memory cells. 


When T starts it reads values from its domain, and 
when T terminates it writes values into its range. T 
and T' are noninterfering if either T < T', or T’ < 


T, or R+ n Ry: = R> n D>. = D-. n Ry: = 9. Tasks 
{T yoo Lane are mutually noninterfering if 
every pair of tasks LF and if (i # j) are 


noninterfering. We will use the following theorem 
which is stated and proved in [2], pp. 39-40. 
Theorem: Task systems consisting of mutually 
noninterfering tasks are determinate. 


The definition of determinancy of task systems 
requires a long development, [2], pp. 35-38, which 
we will not review here. For the purpose of proving 
the correctness of the Algorithm PF, it suffices to 
note that determinancy of a task system implies that 
for the same initial memory state, any execution 
sequence of the task system will end up with the 
same final memory state. We will define a set of task 
systems, and prove that each of them contains 
mutually noninterfering tasks. Then, we will use 
the above theorem to conclude that Algorithm F and 
Algorithm PF compute identical results. 

For each 1 s i, j, k < NODES, 


denote the task 


let TG 


"for D[i, j] 2 D[i, k] : D[k, i] then 
D[i, j]:= D[i, k] + D[k, j]”. 


each k = 1, . , NODES, define task system 
= (t,, 9), where task set T = TG | 1< i, j< 


NODES) and gis the null ees ee relation, i.e. 
no task needs to precede any other task. We will 
now show that each Cy, contains mutually 


noninterfering tasks, and thus conclude that every 


execution sequence of CL produces the same result 


as Algorithm F's execution sequence does. We will 
use Mi to denote the memory cell for the variable 
D[i, j]. Mii = Mob if and only if i=aandj=b. We 


and R to denote the domain and 


range of task Thay: 


will use Dj sj 


Remark 7: (a) 


(b) 


(c) If the input network has 
negative cycle, then Rkj =R 


g. 


Parts (a) and (b) follow immediately from the 


no 
kik — 


definitions A domain and range of a task. For 
part (c), Tj contains the test "D[k, j] > D[k, k] 
+ D[k, j]". Since the network has no negative 


cycle, D[k, k] is nonnegative. Thus the test result 
is always false, and the content of M,. will not be 


kj 
changed. Rkkj = § follows. Similarly, Rik = @ also 
follows. 
Remark 8: If the input network has no negative 


cycle, then Ty contains 


noninterfering tasks. 


mutually 


Because there are no precedence constraints 
between tasks_ in Te, we need to prove that 


Raj n Rab = R a Diab = DK; n Reab = §, for all 
Ci, j) # (a, b). Rij n Rab c {Mii n {M4} =O; 
because (i, j) # (a, b). Rij n Dab ~ {Mio} n{Mp. 
Mok: Mup} = §, for (i, j) # (a, b), j # k, andi # k. 
Similarly Di n Rab @. It follows that oy 


contains mutually noninterfering tasks, for k = 1, . 
) , NODES. As noted before, this implies that 
Algorithm PF is correct. 

Aigorithm PF is programmed to run on the HEP 
computer. The number of processes created is 
minimized in order to reduce the overhead (of the 
create operation). The logic of our program 
referred to as Algorithm HEPPF (HEP parallel Floyd) 
is as follows: 


Kij 


Algorithm HEPPF 
Process MASTER 


= 0; 

for & := 1 step 1 until K-1 do 
~ create WORKER(2); 

execute WORKER(K) 


1 
2 
3 
4 


Process WORKER(2) 


1 fork :=1 step 1 until NODES do 
2 begin | 

3 fori := step K until NODES do 
£ if Di, k}] < then 
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for j := 1 step 1 until NODES do 
execute Vay 


5 
6 
7 lock SYN; SYN := SYN + 1; unlock SYN; 
8 
9 


L1: if SYN < K * k then go to L1 
end 
Algorithm HEPPF was coded and run for the 
experimental timing study. Experiments used 
randomly generated 20-, 30-, and 40-node 
networks. NODES x NODES arc length matrices 
with different densities of non-infinity entries 


distributed uniformly from 0 to 99 were generated. 
The results of our timing study are shown in 
Table 1. Let T, denote the experimental running 


K 
time of the algorithm with K processors. Let S, and 
Ex T,/Ty, and efficiency, 
S,/K, The efficiency of this 
algorithm for networks with 40, 30, and 20 nodes is 
plotted in Figures 3, 4, and 5. It is evident that 
the efficiency tends to be high when the number of 
nodes in the network is a multiple of K, the number 
of processors. For in such a case, each WORKER 
process does exactly the same amount of work, but 
in the case where K does not divide NODES exactly, 
all WORKERs do not do the same amount of 
processing. For example, for each K, WORKER(1) 
performs NODES/K_ executions of statements 4 to 
6, but WORKER(K) performs NODES/K_ executions 
of statements 4 to 6. The WORKERs which finish 
their work earlier must wait for all others, before 
starting on the next iteration. Thus the theoretical 
speed-up should be approximately 
NODES/NODES/K. More precisely, if we let t, 


denote the time for executing one iteration of the for 
loop in statement 3 of procedure of WORKER, and ty 


S59; 


denote the speed-up, 


respectively. 


denote the time for executing statements 1, 
and 10 once, then the theoretical speed-up is 


ee ae Fs 


(NODES 1 


« (F 


For our compiled code of Algorithm HEPPF, t/t, 


estimated to be approximately 1/(2NODES*1). 
this estimate, the ratio 


9) NODES 


Fa) ODES 


NODES + to/t, 


eee + to/t, 


NODES] t 
| 


iS 


Using 


observed efficiency EK 
theoretical efficiency TS //K 


is calculated and plotted in Figure 6. From this plot 
we observe that the overhead for the create and the 
synchronization is relatively small when the input 
network is dense. 


3. CONCLUSION 


Two parallel shortest-path algorithms are 
designed and proved correct in this paper. They 
were both programmed to run on the HEP computer. 


For the first algorithm, i.e. Algorithm PPDM, 
random highway-like sparse networks were 
generated and used as inputs. We observed 


empirically a speed-up of three when five processors 
were employed, for networks with 75 or more nodes. 
For the second algorithm, i.e. Algorithm HEPPF, 
random arc-length matrices of order up to 40 were 
generated and used as inputs. We found that the 
efficiency is higher for larger and denser networks. 
Thus we have clearly demonstrated theoretically as 
well as empirically that parallel processing 
techniques can be used profitably to speed up 
determination of shortest paths in large networks. 
We have also shown how this can be accomplished. 
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Table 1. Running time of Algorithm HEPPF (in secs). 


Density 
NODES = 40 100% 50% 25% 12.5% 
VW) : 
= 1 1.30478 1.24866 1.13903 0.88217 
un 2 0.65522 0.63133 0.58283 0.46305 
Y 3 0.45726 0.44399 0.40812 0.32185 
iS 4 0. 32989 0.32097 0.29727 0.25366 
O. 5 0.26484 0.25992 0.24512 0.21071 
2 6 0.23169 0.22906 0.21123 0.17719 
7 0.19889 0.19627 0.18433 0.15915 
2 8 0.16693 0.16594 0.15423 0.13571 
‘NODES = 30 100% 75% 50% 25% 
W" 
= 1 0.55024 0.53037 0.49828 0.45644 
n 2 0.27684 0.27116 0.25537 0.23737 
Y 3 0.18544 0.18088 0.17221 0.15966 
2 4 0.14774 0.14519 0.13785 0.12816 
a. 5 0.11213 0.11039 0.10760 0.09756 
= 6 0.09417 0.09429 0.08958 0.08582 
7 0.09294 0.08973 0.08699 0.08280 
S 8 0.07550 0.07559 0.07361 0.06762 
NODES = 20 100% 75% 50% 25% 
(7) 
= 1 0.16299 0.15615 0.14249 0.11844 
a 2 0.08213 0.08028 0.07291 0.06457 
@ 3 0.05753 0.05683 0.05195 0.04626 
2 4 0.04165 0.04086 0.03888 0.03528 
& 5 0.03348 0.03304 0.03118 0.02770 
= 6 0.03317 0.03287 0.03016 0.02767 
' 7 0.02533 0.02541 0.02503 0.02292 
3 8 0.02513 0.02479 0.02401 0.02166 
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A PARTITION ALGORITHM FOR PARALLEL AND DISTRIBUTED PROCESSING * 


Shyue B. Wu and Ming T. Liu 
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Summary 


An efficient partition algorithm can be 
applied to solve problems in assignment of tasks 
and resources [5] as_ well as problems in 
scheduling and control of distributed processes 
[3]. Successful solutions to these problems can 
increase system performance and reliability. This 
paper presents an efficient partition algorithm 
and discusses its use in solving the assignment 
and scheduling problems for parallel and 
distributed processing in large multi- 
microcomputer systems. 


A general case of the partition problem § can 
be stated as follows: given a graph, G=(V,L), 
where V is a set of nodes and L is a set of links, 


each associated with a positive number 
representing the weight (which in turn represents 


a communication or execution cost) of the Link; 
we are to partition the graph into K disjoint 
nonempty subgraphs in such a way that the sum 
(called partition cost) of the weights of the 
links which separate the subgraphs is minimized. 


An efficient solution to. the partition 
problem for K=2 can be directly obtained from 
using any of several available network flow 


algorithms [1] [3]. However, for K>2, the problem 
has been known to be NP-complete. With the 
introduction of microcomputers, a distributed 
system with more than two processors are more 
common. Therefore, it is important to obtain an 
efficient algorithm, applicable to K-=processor 
(K>2) systems, for the partition problem. We will 
show how such an algorithm can be obtained from 
the use of network flow algorithms. 


In order to conform with terminology used in 
network flow theory, we shall henceforth use the 
terms networks and subnetworks rather than graphs 
and subgraphs in this paper. 


A K-cut of a network is a minimum set of 
links, the removal of which separates the network 
into K disjoint nonempty subnetworks. The cost of 
a K cut is the sum of the weights of the links in 
the K_cut. 


From the above, we can see that the partition 
problem in general cases is equivalent to finding 


* Research reported herein was supported in part 
by the NSF under grant MCS-77-23496. 
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an optimal K_cut (i-e., 


Theorem 1: 


Theorem 2: 


a K cut whose cost is 


minimum among all possible K_cuts). Thus our 
_ algorithm for the partition problem can be stated 
as follows: 
Partition Algorithm 
i <-- 2; 
Obtain a 2 cut of the given network by using a 


network flow algorithm; 
Do while (i < K)3 
Obtain a 2 cut of each of the two subnetworks 
resulting from the cut previously selected 
by using a network flow algorithm; 
Pick up the cut whose cost is minimum among 
all unselected 2 cuts obtained so far; 
The itl _cut is equivalent to the i_cut plus 
the selected cut; 
i <-- itl; 
End; | 


Our partition algorithm is efficient in the 
sense that it uses a network flow algorithm only 
in the order of K (O(K)) times. Our algorithm ‘is 
also good in the sense that it yields a good 
solution. The following two theorems are _ stated 
to show that our algorithm results in a solution 
with minimum cost if the given network is tree- 
like. The empirical results are also presented 
below to show that our algorithm results in a 
solution with near minimum cost in general cases. 
For the proof of the theorems and the detail of 
the performance studies, readers are referred to 


[5]. 


For a tree~like network, if two nodes 
are in the same subnetwork of an optimal K_cut 
(K>2), then these two nodes are in the same 
subnetwork of an optimal K-i_cut (1<=i<=K-2). 


For a tree~like network, any 
K-l_ cut is a subset of an optimal K cut. 


optimal 


Our partition algorithm has been programmed 
for testing 720 randomly generated networks each 


with six nodes. For each test network, we 
‘collected error percentage of an icut (E(i)), 
which is defined by (SUBOPT - OPT)/OPT, where 


SUBOPT is the cost of the i_cut obtained by using 


our algorithm and OPT is the cost of the optimal 
i_cut obtained by an exhaustive enumeration 
method. 


The distribution of E(i) is given in Table l, 
where NOPT represents the number of networks with 
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an optimal solution (no error), 1% represents’ the 
number of networks with an error between 0% and 1% 
(0% < E(i) <= 1%), and so on- For example, 10 of 
the 720 networks for which we obtained a 4 cut, 
had solutions with errors between 3% and 4%. 


Table 1: Distribution of Error Percentage 


i-cut NOPT 1% 2% 3% 4% 5% MORE 
3 696 0 0 4 1 0 19 
4 655 7 4 13 10 13 18 
5 581 42 43 1l 15 14 14 
From Table 1, we see that our partition 
algorithm obtains an optimal solution about 90% 
(1932/2160) of the time, and obtains a solution 


with more than 5% of error only about 2.5% 
(51/2160) of the time. The average E(i), which is 
not shown in Table 1, is approximately in the 


order of 10 ** -3 (0.1%). Thus we feel that our 


partition algorithm is good enough for general 
applications. 

With the introduction of microcomputers, 
there has been a great interest in constructing a 
distributed system from a large number. of 
microcomputers [2]. In such a system, the memory 
of each processor is restricted [2]. Therefore, 


there 
software resources, such as modules 


is a need to distributed (or assign) system 
of operating 


systems, over (to) the processors in the system; 
this is called resource assignment or resource 
allocation. 

In the following, we outline the use of our 


algorithm to solve the resource assignment problem 


in large multi-microcomputer systems [4], and 
discuss the impact of the solution. 

The resource assignment problem in large 
multi-microcomputer systems can be _ stated as 
follows: 1) given a module network, G=(M,R), 
where M is a set of modules (or. software 
resources) and R is a set of links, each of which 
(Rij) is associated with a link weight 
representing the communication cost per unit 
distance between two modules (Mi and Mj); 2) 


given a system node network, G=(P,D), where P is a 
set of processors and D is a set of links, each of 
which (Dij) is associated with a link weight 
representing the number of unit distance between 
two processors (Pi and Pj); and 3) we are to find 
a mapping function f : M <--> P such that the total 
cost given by S= 3° Rij * D £(i)f£(j) is minimized. 


We can use the partition algorithm to obtain 
a K.cut of a module network and that of a system 
node network. The subnetworks of the module 
network after the K_cut can then be assigned to 
the corresponding subnetwork of the system node 
network. Therefore, we can assign system software 
resources to system nodes through the use of our 
partition algorithm. 
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can result in 
a set of software 
Therefore, a 


A better resource assignment 
less communication cost when 
resources are requested for service. 
better resource assignment can eliminate 
unnecessary message traffic in a system, thereby 
minimizing interconnection limitation which arises 
due to message traffic saturation. The minimiza- 
tion of interconnection limitation, in terms of 
the number of nodes that can be interconnected and 


the amount of message traffic that can be 
supported, may make it possible to use _ large 
multi-microcomputer Systems for a variety of 


applications [4]. 


A better resource assignment can also make 
the task assignment easier in large 
multi-microcomputer systems. The task assignment 


problem is how to assign program modules to system 


nodes (processors) so as to minimize the _ total 
execution and communication costs. In large 
multi-microcomputer systems, each node may be 
dedicated to provide a specific function. A task 
requesting a specific resource is likely to be 
assigned to the node providing this specific 
resource. Therefore, the task assignment can be 
easily achieved in this case. 

For task assignment, if there is an 


overloaded node, the program modules assigned to 
this node can be reassigned to other nodes as long 


as the extra cost introduced by the reassignment 
is paid for. The decision of reassignment can be 
made by using our algorithm to partition the 
reassignment network [3] to see whether’ the 
reassignment is worthwhile. Through the use of 
reassignment of tasks, it is likely that 


scheduling and control of distributed processes in 
large multi-microcomputer systems can be improved. 
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Abstract -- In this paper we describe a tree-structured 
machine, suitable for VLSI implementation, that handles all the 
frequently encountered database operations efficiently. N 
elements are maintained on an N-processor version of the tree 
machine. We shall describe algorithms, based on a new 
concept of associative search, for insertion and deletion of 
elements in the tree. The tree machine can handle a large 
class of searching problems. Insertion, deletion, queries, and 
updates can all be processed in O(log N) time units. It is 
especially suitable when a sequence of such operations is to 
be processed in a pipelined fashion. |/O time dominates the 
total time to execute more complex operations such as join of 
two relations or sorting. Once data are in the machine, it takes 
usually O(log N) time units for the first results to emerge. 
Therefore it is very suitable for on-line systems where fast 
response time is needed. Some major obstacles to be 
overcome are discussed. 


1. Introduction 

Database management systems are concerned with the 
task of providing fast retrieval, storage and update operations, 
in response to users’ requests. In recent years, database 
systems have been growing in size and software systems tu 
manage them are becoming increasingly more sophisticated 
and complex. Also, as demand for services increases, many 
data processing installations have reached the point of 
saturation. Backend database systems have been proposed 
as a solution to the problem of overloaded installations. The 
reader is referred to [17] for a discussion on such systems. 


‘This research was supported in part by the Defense Advanced 
Research Projects Agency under Contract F33615-78-C-1551 
(monitored by the Air Force Office of Scientific Research), in part by 
the National Science Foundation under Grant MCS 78-236-76, and in 
part by the Office of Naval Research under Contracts 
N90014-76-C-0370 and NOQ0014-80-C-0236. The author is 
supported in part by Furdacao de Amparo a Pesquisa do Estado d> 
Sao Pauto under Grant 76/517, and in part by the Institute of 
Mathematics and Statistics of the University of Sao Paulo, Brazil. 
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Various design efforts of specialized hardware with novel 
architectures to handle database problems have been carried 
out [1], [7], [14], [20], [25]. There are abundant literature and 
survey articles on these designs [11], [13], [22], [24]. 


Very Large Scale Integrated circuitry has been increasing in 
speed and density at an amazing rate. The amount of 
components on a single chip is claimed to reach several 
millions by the end of this decade [18]. This has aroused a 
surge of interest in developing customized designs of 
algorithms implementable on silicon. Leiserson [15] proposes 
a systolic priority queue, a structure with the possible 
operations of insertion, deletion and minimum extraction. The 
rebound sorter of Chen et al. [8] handles sorting problems. 
Bentley and Kung [3] present a design of a tree machine for 
searching problems. Kung and Lehman [12] propose several 
linear arrays of processors capable of performing such 
operations as intersection, join and duplicate removal. These 
designs provide efficient handling of specific tasks. 'n 
database applications, some queries require the execution of 
a sequence of database operations before the answer ¢ 
obtained. It is therefore desirable to have a single 
special-purpose device which can provide efficient solutions 
to all basic database operations such as search, insertion, 
deletion, updates, sort, join, union, etc. For this purpose we 
have chosen the tree machine of Bentley and Kung [3], which 
can solve all of the "decomposable searching problems" [4], 
and attempted to extend it to handle other basic operations. 
Tree-structured macnines have been proposed to handle 
other types of problems. The designs by Berkling [5], Mago 
[16], Sequin, Despain, and Patterson [21] and Wilner [26] are 
general-purpose computing devices. Hollaar [10] presents a 
design for merging sorted lists. Browning [6] considers 


several applications as sorting and NP-complete problems. 


259 


2. General System Configuration - 

In Figure 1 we show the tree machine acting as a hack-end 
machine to a host computer. Users’ database manipulation 
commands are passed on to the tree controller which, using 
some auxiliary information, will locate where in the mass 
Storage the needed irformation reside. Data clustering is an 
important issue and is discussed in [2]. It will then command 
the 1/O controller to transfer data to the tree machine. 
Loading of the tree machine will be the bottleneck of the 
system and will be discussed later in the ‘section on 
implementation issues. Once the tree machine is loaded, the 
tree controller will issue commands to the tree machine to 
carry out the required operations. The results output from the 


tree machine will be returned to the host computer. 
7 oe Data flow 
Control 


Mass Memory 


Tree 
| Controller — 


f 


Tree 


Controller 


| Machine 


Figure 1: Asstem configuration. 


3. The Tree Machine 

The tree machine has three kinds of nodes (see Figure 2): 
O-nodes, O-nodes, and A-nodes. Each one of a collection of 
records resides in a O-node, which is provided with some 
logic to carry out a limited repertoire of instructions. The 
O-nodes broadcast streams of instructions and/or data to the 
1-nodes where they are executed in parallel. The O-nodes 
compute results which are then combined by the A-nodes to 
produce the final resuit. Selection of records satisfying a 
conjunction of conditions can easily be performed by 
broadcasting the conditions to the [J-nodes which can then 


Input root node 
at 


Output root node 


Figure 2: The tree machine. 


decide which ones are to be selected. First we shall review the 
insertion ‘and deletion algorithms mentioned in [3], and 
propose a new space allocation scheme. Then we shall 
discuss how data flow in the O-nodes and A-nodes should be 
disciplined. | | 


3.1. ANew Space Allocation Scheme . 

One way of doing insertion is to maintain a count in each of 
the O-nodes. Each count in a O-node specifies the number of 
free C-nodes which are its descendants. Each time a new 
element is to be inserted, a O-node will pass on the element to 


_ the son which has free -nodes below (choosing an arbitrary 


260 


one if both are eligible’. Then it will update its own count by 
decrementing it by one. Similarly, when an element contained 
in a D-node is deleted, some of the O-nodes need to have 
their counts updated. More specifically, these are all the 
O-nodes which lie on the path from the input root node to the 
particular O-node where deletion has occurred. This can be 
done by proceeding backwards from the deleted node to the 
input root node, adjusting the counts on its way up. O(log N) 
steps are therefore necessary to adjust the counts, where N is 
the total number of Cl-nodes of the tree. While this scheme: 
has the advantage of being very general, it has the drawback 
of requiring a storage for the count, as well as the associated 
logic needed for updates, in each of the O-nodes. 
Furthermore, since counts need to be adjusted after a deletion 
is made, it makes pinelining of arbitrary sequence of insertions 


and deletions more difficult. 


We wish to design new insertion and deletion algorithms 


with the following two objectives: 


1. Arbitrary sequence of insertions and deletions can 
be easily pipelined. 


2. No counts nor associated logic for updates are to 
be maintained in the O-nodes. 


We have found a way to achieve the above if the following 
assumptions are made: 


1. Asingle count is kept in the tree controller. 


2.For each delete command issued by the tree 
controller, there exists always one and only one 
item in a C)-node which will be deleted. 


Consider each C-node as containing storage for two fields, 
node.freeposition and node.content. If a O-node is free, then 
node.freeposition contains an integer from O to N-1, where N is 
the total number of -nodes in the tree. Also, for simplicity of 
notation, we write n,; for the [-node whose freeposition field 
contains i, 0 <i <.N-1. 
freeposition contains A. Node.content is the value of the item 


if a O-node is occupied then its 


stored in the (1-node which, for simplicity, will be assumed to 


be an integer. 


a] FirstFree 


Figure 3: An empty tree. 


If the tree is empty (i.e., it stores the empty collection), we 
assume that the free O-nodes of the tree are ng, Ny, No, ..., 
Ny.1; in any order. (See Figure 3, where we have omitted the 
bottom half of the tree machine.) We also assume that the tree 
controller maintains an integer count called FirstFree, such 
that the free Ol-nodes are Neirstrrees "FirstFree +1)» ON-41- 
FirstFree contains 0 if the tree is empty and contains N if the 


tree is full. 


3.1.1. Insertion 


To insert an element X, the tree controller will generate an 


insert instruction which has _ two. parts, namely, 
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instruction.freeposition and instruction.content. 
Instruction.freeposition will indicate which O-node is to be 
removed from the pool of free C-nodes. The tree controller 
ASSIQNS NfirstFree tO be that node.  Instruction.content 
contains the value to be inserted. This is shown as follows. 


instruction.freeposition : = FirstFree; 
instruction.content: = X; 
FirstFree := FirstFree + 1 


All that the O-nodes need to do is merely to broadcast the 
instruction to its two sons. Simultaneously, each of the 
Cl-nodes will try to see if it has been selected as the node to 
receive the element being inserted. Exactly one such node 
will be found and this will mark itself as occupied after 
redefining its content field. This is shown as follows. (Figure 4 
shows the tree after 6 elements have been inserted to an 
initially empty one.) 


if node.freeposition = instruction.freeposition 
then node.content : = instruction.content; 


node.freeposition := A 


[sl FirstFree 


# = occupied 


Figure 4: After 6 insertions. 


3.1.2. Deletion 

We consider deletion of an element from the tree based on 
the content of that element. Suppose that we wish to delete. 
the element X from the tree. Note that, by the assumption 
made before, always one and only one (C)-node will be freed, 
whenever a delete command is issued. This means that the 
tree controller will know beforehand that one of the originally 
occupied Cl-nodes will be able to return to the pool of free 
[-nodes, even though it does not know which one. Therefore, 
the delete instruction issued by the controller will contain not 
only the content X to guide the deletion, but also the value that 
should be stored into the freeposition field of the freed node. 


FirstFree : = Firs:tFree - 1; 
_instruction.freeposition : = FirstFree; 
instruction.content: = X 


Again the O-nodes need not to do anything more 
complicated than merely broadcasting the instructions. Each 
of the [-nodes will attempt to match the content it has with 
the content in the instruction. Only one will find a match and 
that ene will be immediately returned to the free pool by 
redefining its freeposition field. 


if node.content = instruction.content 
then node.freeposition : = instruction.freeposition; 
node.content := A : | 


Here we have used A to indicate the null content. Note that 
the functions performed by the C-node in the insert and delete 
commands are symmetrical. We obtain one from the other by 
merely interchanging the words content and freeposition. 
Figure 5 shows what remains after two deletions have been 
made to the example illustrated by Figure 4. 


Ca] FirstFree 


# = occupied , 


gern, ) 


4 ee as : | 
oo oh. a ee mo 


Figure 5: After 2 deletions. 


3.1.3. Comments on the Algorithms 

In the original insert 9n scheme mentioned at the beginning 
of this section, the element to be inserted is passed down the 
tree through a path of O-nodes until a free O-node is reached. 
The selection of this path is guided by the O-nodes which usv 
their own count information as well as those of their sons. In 
‘the new scheme, the element being inserted does not follow 
any particular path, but is merely broadcast to all the O-nodes. 
It takes advantage of the ability of content addressability of the 
tree machine to do the selection of the free L1-node. Also, in 
the original deletion scheme, log N counts in the O-nodes 
need to be adjusted. Since we do not know which counts are 
to be incremented until the deletion is done, pipelining was nct 
so easy to achieve. Here we have only two values to be 
adjusted, namely, those of FirstFree and of the freeposition 
field of the deleted node. The interesting thing is that both 
values can be determined at the. time the delete command is 


issued by the controller. The reader can easily see that 
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pipelining arbitrary sequence of deletes and inserts presents 
no problem at all. In some sense, we have factored out the 
counts and logic from the O-nodes to the tree controller, 
thereby reducing the space needed for its implementation. In 
VLSI designs, there is often a trade-off between space and 
time. In this case, however, the new space allocation scheme 
has allowed us to reduce space requirements and at the same 
time achieve a better performance. 


3.1.4. Comments on the Restrictions 

Some restrictions have been made in order to make the 
proposed scheme applicable. In cases where a delete 
command may find many elements or none qualified for 
deletion, some more processing is required before the delete 
command can be issued. First those elements qualified for 
deletion should be selected and deletion can proceed using 
fields which uniquely identify them. In some applications 
deletion of a record is performed after some processing has 
been done on the same record. Therefore its presence among 
the collection of records is certain. Furthermore, if deletion is 
based on a primary key, then the restrictions are met and the 


scheme applies. 


3.2. Disciplining the Data Flow 

In operations where only one output is involved, new 
commands can be issued to the machine while the results a-e 
being handled at the A-nodes to be output. In other words, 
pipelining is easily achieved. In some operations, however, 
many results are produced in the Cl-nodes. These will 
traverse through the A-nodes until they reach the output root 
node. Given the funneling nature of the output binary tree 
(i.e., the bottom part of the tree machine, as shown in Figure 
2), the A-nodes should cooperate among themselves in order 
to produce an orderly evacuation of the many results. We 
shall provide some storage in each A-node. If its storage is 
empty then the A-node will examine its two sons and take the 
contents of anon-empty son. If both sons have information to 
be transmitted, then it will select one according to some fixed 
rule (such as always picking the leftmost, or selecting the one 


with minimum value in some specified key). 


Some of the results may have to be retained in the (-nodes 
for quite a while before they are accepted by the A-nodes. In 


order to protect these results from being destroyed by the 
incoming stream of instruction or data, the broadcasting of 
information in the O-nodes should also be disciplined. This 
can be done by the tree controller which can turn off the 
pipeline and wait unti! the machine is flushed before starting 
another operation. To perform one operation, however, such 
as the full join, many tree machine instructions may be 
required, and some of these may also produce multiple results 
in the O-nodes. While it is reasonable to turn off the pipeline 
for different operations (such as a full join and a subsequent 
union operation), it is too expensive to do so with machine 
instructions performed within one operation. We now face a 
problem of trying to retain whenever possible pipelining of 
instructions, some of which may produce multiple results. 
Flow of the stream of instructions or data will not be 
continuous and intermittent pauses may sometimes be 
necessary. This requires the full cooperation of all the 
O-nodes and the tree controller. Each O-node has storage to 
hold the information to be broadcast. It will send this 
information to its two sons only if both are empty, i.e., ready to 
accept data. Each O-node will examine its two sons. If both 
are ready, then it will transfer its contents to its two sons and 
declare itself ready. Similarly, the tree controller will only put 
new information to be broadcast in the input root node if it is 
ready. The C-nodes can control the flow of information from 
the layer of O-nodes immediately above them by declaring 
themselves ready only if their results, if any, have already been 
taken out by the A-nodes. 


3.2.1. 
If any result formed in a O-node is always readily taken out 


Observation 1 


without delay, then broadcasting items aj, ag, a3, ... down the 
tree will result in alternate empty layers of O-nodes. With a 
tree machine of N C1-nodes, (log N)/2 layers of O-nodes will be 
empty, as shown in Figure 6. Also, it takes log N steps for any 
item a; (counted from the instant it enters the input root node) 
to reach a O-node. 


3.2.2. Observation 2 


Consider a situation as above, in which alternate layers of 


O-nodes are empty. Suppose now the result computed in. 


some of the CJ-nodes cannot be removed by the A-nodes for a 


long time. This C1-node will therefore start to block the flow of. 
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(a) ! (b) 


nf o 


(d) 


Figure 6:Alternate empty layers of O-nodes. 


information above it until all the O-nodes on the path from the 
input root node to it are filled (see Figure 7). Since alternate 
layers of O-nodes are originally empty, (log N)/2 more new 
elements can still enter the tree before the path in question 
becomes full. Each of these new elements enter the input root 
node every other step. Therefore it takes log N steps to fill up 
this path. 


co ee der 


‘Figure 7: Blocking of flow. 


3.2.3. Observation 3 

lf at a certain instant all the -nodes are empty (creation of 
an "empty layer"), then this "empty layer" will be propagated 
toward the top of the tree in log N time. (Also, if the creation of 
an “empty layer” of [1-nodes occurs every other step in a total 
of logN steps, then (logN)/2 alternate empty layers of 
O-nodes will be created, as indicated'in Observation 1.) 


3.2.4. Observation 4 
Since each O-node broadcasts to its two sons only if both 


-are ready, any,item a; which enters the tree will reach all the 


C]-nodes,. though not necessarily at the same time. However, 
the: items will visit a fixed ’M-node in the same order they 


entered the input tree node. 


4. Database Operations | 

First we briefly discuss the search problems. The reader is 
referred to [3] for details. Next we consider the sort and 
remove-duplicates operations. The description of the join 
operation will constitute the major part of this section. Finally, 
the union and intersection operations will be briefly 
mentioned. Other database operations (such as division) will 
not be shown simply because they will not add anything new to 
the presentation. If we associate one bit to each field of a 
tuple and consider it to be valid only if the corresponding bit is 
on, then projection can be done by manipulating the 


appropriate bits. 


4.1. 


By means of broadcasting, 


Search Problems 
all the N (-nodes of the tree 
machine can receive a message sent out from the input root 
node in O(log N) time. A variety of search problems has been 
considered in [8]. They also show that, with pipelining, M 
in O(M + log N) 


Furthermore, several selection operations can be pipelined, 


operations can be performed time. 
such that the time will be linear in the total number of 
conditions. 


4.2. Sort and Remove-Duplicates 

Sorting a collection of records on some specified key is 
surprisingly easy in the tree machine. Recall that when trying 
to output multiple results stored in the D-nodes, the A-nodes 
are instructed to accept data from a non-empty son. A 
selection rule is used if both sons are non-empty. If we use the 
rule of selecting the minimum value in a specified key, then the 
output records will be sorted in ascending order. See an 
example in Figure 8, where a simplified representation of the 
bottom part of the tree machine is shown. The records reside 
in arbitrary positions of the O-nodes. The example shows the 


first steps to output the records in sorted order. 


In log N steps, the minimum of the collection of elements 
will emerge at the output root node. From then on, at every 
other step the next element in increasing order will exit the 
machine. (For simplicity, we have assumed enough datapath 
to transmit a tuple in a single cycle. This is similarly assumed 
throughout the paper. In a realistic situation, the time 


complexities should te multiplied by appropriate factors.) 
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. Figure 8: Sorting. 


Now it becomes clear how duplicate removal from a 
collection of K records can be performed. The collection is 
simply sorted. At the output the controller tests pairs of 
consecutive elements for equality. For every pair of equal 
elements it deletes one of the occurrences and this can be 
done while the sorting is still going on. Therefore in essentially 
2K steps we realize the duplicate removal operation. 


4.3. The Join Operation 

The join operation is performed on two relations over a 
specified attribute with common domain. The result of the join 
is another relation. A tuple of the result relation is composed 
by the concatenation of two tuples, one from each of the two 
relations, with identical values in the common attribute. 


4.3.1. 
We assume that a (-node has storage to hold a relation 


Preliminary Assumptions 


identification, a tuple and a tuple identification. While a tuple 
may require quite an amount of storage, the tuple id may need 
considerable less storage (log K bits if each of the K tuples of 
a relation is associated to a different number from 0 to K-1). 
The controller is provided with a parallel associative store, 
enough to hold log N entries. Each entry can hold a tuple and 
its corresponding tuple id. This associative store will allow the 
controller to retrieve a tuple content given its tuple id, 


assuming the tuple is present in the storage. 


4.3.2. Actions of Different Node Types 
Suppose we want to perform a join operation of two 


relations A and B, each with K tuples, over some given 


attribute. For convenience of exposition, tuples of relations A 


and B will be referred to as A-tuples and B-tuples, respectively. 
- These tuples reside in arbitrary positions of the O-nodes. See 
Figure 9. To carry out the join operation, each of the three 
types of nodes will execute its instructions until the join 


operation is complete. 


A = A-tuple 
B = B-tuple 


Figure 9: Two relations residing in the tree. 


The O-nodes are instructed simply to broadcast whatever 
they receive to their sons, obeying the protocol established in 
the previous section. 


The (-nodes holding A-tuples ‘are instructed to send a 
copy of the tuple and its corresponding tuple id to the 
A-nodes. Again, for ease of exposition, an A-tuple plus its 
tuple id will be referred to as A-information. Once this is done, 
its mission is consid«red accomplished. Any information 
received from above will simply be ignored until the join 


operation is complete. 


Each ()-node holding a B-tuple is instructed to extract the 
attribute value over which the join is being performed and 
compare it with the incoming information, which include an 
attribute value of an A-tuple, as well as its tuple id (see below). 
In case of a match, a copy of the B-tuple and the received 
tuple id should be sent out to the A-nodes. (We shall refer to 
these as B-information.) This action is to be repeated for every 
incoming information. As we have seen earlier, these 
C-nodes can always hold the stream of information coming to 
them by declaring themselves not ready to accept new data. 


The A-nodes are instructed simply to pass on data toward 
the output node of the tree. In case a A-node finds both sons 
with information to be transmitted, it will always give priority to 
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the B-information. 


4.3.3. Carrying Out a Join Operation 

Starting from the situation as in Figure 9, all the O-nodes 
holding A-tuples will try to send out their A-information to the 
A-nodes. After logN steps, one of these A-information will 
emerge at the output. And from then on, a new information 
will emerge at every other step. If the output contains 
A-information, the controller will store the tuple and its id in 
the associative store. Furthermore, this tuple id plus the 
attribute value over which the join is being done are redirected 
to the circular input node of the tree, to be broadcast 
downward. (We shall refer to these as a-information.) After 
another log N steps, the first of these information will reach all 
the C1-nodes (see Figure 10). 


A A = A-tuple 
B = B-tuple 
A= A-intormation 
Q= a-information 


Figure 10 


Cl-nodes holding A-tuples will ignore this information, 
without ever blocking its flow. [-nodes holding B-tuples, as 
they are instructed to, will try to match the two attribute values 
(one extracted from the tuple it contains and the other from 
the a-information it receives from above). In case of a match, 
it will make a copy of the tuple B and send it out together with 
the A-tuple’s id (or B-information). This tuple id corresponds 
to some A-tuple which should be concatenated to the B-tuple 
to form a result tuple. 


These B-information (many such may be formed) will start 
descending the A-nodes, log N steps being necessary for the 
first of them to reach the output root node (see Figure 11). If 
the output contains B-information, the tree controller will 


locate the A-tuple in the associative store, given its id. Thus 


the result tuple can readily be assembled. 


A = A-tuple 
B = B-tuple 
A= A-information 
B= 8-information 
a = a-information 
Figure 11 


We now show that overflow will never occur, as more and 
The (-nodes 
holding A-tuples ignore any information broadcast to them and 


more A-information are added to the tree. 


will never block the downward flow. The O-nodes holding 
B-tuples may block the flow if, after a match, the B-information 
it has produced for output is still waiting to be taken by the 
A-nodes below. Let c, be the clock cycle in which the first 
B-information is formed in one or more U-nodes. In 
subsequent clock cycles, more B-information may be 
produced at the O-nodes. Let cy > c, be the nearest clock 
cycle to c, in which none of the D-nodes is holding any 
B-information. Recall the B-information have priority over the 
A-information to traverse among the A-nodes. Once the first 
of these B-information formed between c, and cy emerges at 
the output root node, an A-information will have a chance to 
get out only after all such B-information have been output. 
However, the leader of these B-information will take log N 
steps to reach the exit. During the same time, (log N)/2 of the 
A-information will have emerged. These will not cause 
overflow because, prior to c,, the flow in the O-nodes has 
been unrestricted and, by Observation 1, alternate layers of 
O-nodes are empty. Therefore these A-information will be 
appropriately accommodated without causing overflow. At Co, 
none of the O-nodes are holding any result to be output. 
Therefore, by Observation 2, each time this happens, an 
"empty layer" of O-nodes will be created and it takes log N 
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cycles for the "empty layer" to propagate to the top. The next | 


A-information will have a chance to exit the tree after all the 
B-information formed between Cy and Co have been output, 
provided it has not been caught up by some other 
. ) The last 
B-information will take at least log N cycles to get to the exit, 


B-information produced at still later cycles. 


therefore an empty input root node will have been created to 
accommodate the next A-information. By a similar reasoning 
and using Observation 4, we can show that each B-information 
which emerges from the tree machine will find the needed 
A-tuple in the associative storage. 


The time necessary to perform the join can easily be 
computed if we fix our attention at the output node of the tree. 
From the instant the first A-information emerges at the output, 
some information, either A or B, will come out every other step. 
There are exactly K A-information to be output from the tree, 
and each B-information will correspond to a result tuple. 
Therefore it takes logN + 2(K + # of result tuples) steps to 


realize the full join of two relations. 


4.4. Union and intersection 

The union and intersection of two sets A and B of K 
elements each can similarly be obtained. Briefly we describe 
how the intersection can be performed. All tuples of one 
relation are sent out *o be compared simultaneously by all 
tuples of the other rela ion. The matches constitute the result 
of intersection. Thus intersection can be obtained in 
2(K + JA N Bl) steps and union in 2 |A U B] steps. 


5. Implementation Issues and Major 
Problems 


5.1. Chip Layouts 

First we discuss how we can place the different types of 
nodes on chips. The two "mirrored" binary trees of Figure 12 
(a) can first be "unmirrored" to the one as shown in Figure 12 
This 
space-economical layout has first been suggested by Mead 
and Rem [19]. 


proportional to the number of nodes on a chip. Using the 


(b), which is then laid out as in Figure 12 (c). 
In this layout, the amount of space is 


layout as in Figure 12 (c), we place the O-nodes on as few 
number of chips as allowed by the achievable circuit density, 


and then combine these chips together with chips containing 
only O-nodes and A-nodes. 


So KAT 


Figure 12: Chip layout. 


5.2. Loading the Tree Machine 

Loading the tree machine constitutes the bottleneck of the 
system and is the major problem to be solved. One solution is 
to provide the capability of reading multiple tracks to load 
subtrees in parallel, bypassing the input root node of the tree. 
If such a solution is used, then the proposed space allocation 
scheme will have to be modified accordingly. The solution 
also calls for a considerable amount of communication paths 
from the tree to the outside world. If many chips are needed to 
implement a tree machine, the required amount of pins for 
parallel loading will be readily available. If only a few chips are 
needed, then this so'ution cannot be used. In this case, we 
can perhaps construct several tree machines and overlap the 
1/O and computation proper. The number of such devices 
depends on the desired response time as well as the various 
timing characteristics. 


5.3. Number of'Chips Required 

Let us estimate the number of chips to implement a tree 
machine with a capacity of holding a cylinder of data. This is 
_ motivated by the DBC design which assumes that a cylinder of 
data can be searched in a complete revolution [1]. We choose 
arbitrarily a tuple size of 64 bytes. With a cylinder capacity of 
500,000 bytes, the tree machine will have 500,000/64, or 
roughly 8,000 O-nodes. We have designed a prototype chip 
implementing a simpler version of the tree machine where only 
insertion, deletion and membership testing have been 
considered [23]. Using that experience, we estimate that, for 
the complete version, about 8 [O)-nodes can be put on one 
chip. Therefore 8,000 (2-nodes will require 1000 chips, which 
is feasible with current technology. With the rapid increase in 


circuit density, this number will become approximately 60 
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chips in four years and only 4 chips in about ten years. 
Provided that the problems which arise with the increased 
density (such as that of powering) can be solved, this 
approach seems very promising to implement a large capacity 
tree machine. 


5.4. Problem Partitioning 

We have assumed throughout this paper that all the related 
This 
assumption is certainly not realistic. Problem partitioning for 


data can be accommodated in the tree machine. 


cases in which the problem size exceeds the device capacity 
should also be studied. 


6. Conclusion 

We have proposed a design of a _ high-performance 
tree-structured machine to handle the basic database 
operations. The tree structure is very desirable for its 
logarithmic path from the root to any leaf node. This makes 
and 


broadcasting of instructions a very § convenient 


inexpensive operation. Also, being a structure of two 
"mirrored" complete binary trees, one for input and one for 
output, the tree machine is especially suitable for pipelining of 
instructions and data. In the tree machine, data reside in the 
tree and different operations can be performed without having 
to move data around. This is important for processing queries 
which require the execution of a sequence of database 
operations before the answer is obtained. If all these 
operations can be performed at one single site, less 1/O will be 
required. Bentley and Kung [8] make an_ interesting 
observation about the "computational structure" of the tree 
machine: it has very small input and output channels, with 
massive computation going on in between. Although much 
search or other efforts are needed to process a query, the 
answers frequently consist of only a few records. The tree 


machine seems especially adequate for such operations. 


The particular design we have proposed here is an attempt 


to exploit the recent VLSI technology. One peculiar 
characteristic in this technology is that logic is cheap but 
communication costly. Also, by replicating one basic cell a 
large number of times: on a chip, design costs are reduced. 
This is why regularity and locality are such important 


properties in VLSI design (see [9] for a detailed discussion). 


The tree machine possesses precisely these properties. There 
are only three kinds of basic cells (or nodes), each of which 


interacting only with a few neighbors in a very regular way. 


This approach seems especially attractive in the near future 
when circuit density continues to rise. 
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ABSTRACT 


In the design of very large data base machines, 
multiple processors can be employed effectively to in- 
crease performance. When massive amounts of data 
must be moved, the topology of the processor inter- 
connections is important. To determine an appropri- 
ate interconnection scheme, a simple model of a very 
common but difficult data base operation is used to 
determine an interconnection scheme. This is the el- 
imination of duplicate information in a collection of 
data elements. In particular, five methods are con- 
sidered which can perform the elimination of dupli- 
sates. These methods and their corresponding inter- 
connection topologies are analyzed and compared to 
help determine a suitable multiprocessor topology 
and computer architecture for a data base machine. 
A hybrid architecture is shown to be near-optimal 


Key words: database machine; computer network; 
multiprocessor; computer architecture; relational 
data base; computer system analysis. 


1. INTRODUCTION 


Data base management systems perform well today 
largely through the use of sophisticated software struc- 
tures to enhance the retrieval of the desired information. 
The price paid in terms of software complexity and 
storage overhead is quite severe, and may become exces- 
sive for very large data bases. Unfortunately, there will 
inevitably be some queries of considerable importance 
which cannot be performed in an acceptable length of 
time because the proper structures do not exist in the 
data organization, even though the information is 
present. One approach is to build a special purpose 
machine for data base management systems (DBMS). 


A great deal of interest has been generated recently 
over the concept of a multiprocessor data base machine 
and the topology that it should take. Since one of the 
most significant ways that hardware can be used to 
attack a problem is through parallelism, it is not surpris- 
ing that virtually all[1]-[12] of the recent proposals have 
utilized this idea to a significant degree. One might ask 
the following questions: 


{1) What tasks performed by a DBMS can be im- 
proved by the use of parallel processors? 

(2) How can multiple processors best be organized 
to optimize the execution of those tasks? 
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It is instructive to consider the following problem: 
Given the telephone directory as a data base, determine 
the name of the occupant at a known address. Although 
the information is clearly present in the directory it can- 
not be readily retrieved. If this question were to be asked 
frequently, the problem could be solved by producing an 
index of addresses. It is not feasible to build such so- 
called secondary indices for all possible queries, since 
those queries can become arbitrarily complicated. The 
only general solution is to be prepared to search the 
entire data base. 


A number of generic operations on data bases can be 
identified which can provide a great deal of insight into 
the requirements for efficient data base support. In the 
relational model, for example, these operations might 
include restriction, projection, and equi-join, among oth- 
ers. In the implementation of these operations, some 
more fundamental operations occur repeatedly. One of 
the most expensive of these is the elimination of dupli- 
cates in a relation, performed every time a projection 
occurs. An equivalent operation is performed in any data 
base, however, and this procedure is certainly not unique 
to the relational model. The elimination of duplicates is 
so expensive that the user is often given the opportunity 
to tell the system when it need not be done. We have 
chosen to study this particular operation at some length. 


2. THE ELIMINATION OF DUPLICATES 


During an exhaustive search as well as during gen- 
eral set-oriented operations, a DBMS collapses the data 
required by eliminating certain fields. Further reduction 
of the data is then possible because the remaining fields 
contain many duplicate entries. In the handling of a com- 
plex query this reduction in data is crucial and must be 
performed several times. 


In terms of our telephone directory model we can 
consider the following operation: 


List all of the street names present in the directory. 


The stripping away of the names, street numbers, and 
telephone numbers will leave the desired information, but 
in a highly redundant form, since many people live on the 
same street. The result may be thought of as a list of 
numbers and the problem is to eliminate multiple 
occurrences of a number. 


We shall assume that the number of elements N mak- 
ing up the list is large — too large to be reasonably sup- 
ported by a single processor. Duplicates can be elim- 
inated by exhaustive comparison or by sorting, or by 
some combination of the two. The advantage of the sort- 
ing approach is this: direct comparison of all pairs of ele- 
ments in a list of length N requires O(N*) comparisons to 
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eliminate all duplicates. Sorting algorithms, on the other 
hand, may require only! O(NlogN) comparisons worst 
case and may, depending on the order in the list, require 

a substantially smaller number than that. Algorithms 
vist. in fact, for sorting in O(N) operations[13]. How- 
ever, sorting implies the moving of a large amount of 
data, more or less randomly. This is awkward, particu- 
larly if the elements to be sorted are in different proces- 
sors. Thus a tradeoff exists between the movement of 
data and the number of comparisons, vepepane on the 
approach chosen. 


We shall compare the methods to follow in two ways: 
the cost of computation C and the cost of communication 
M. The computation will be measured crudely by estimat- 
ing the number of comparisons required. The interpro- 
cessor movement of data is measured by the sum of the 
transmission of every element across every interproces- 
sor link. If an element must traverse three such links, 
then three message element links must be counted. 
Computation and communication costs will be deter- 
mined both for the total requirements and for the busiest 
processor and link, respectively. This allows for analysis 
based on either through-put requirements or response- 
time requirements, i.e. bandwidth or latency. 


3. PARALLEL METHODS 


Assume that a number of identical computers (P) 
are connected so that they can communicate in a fairly 
intimate way among themselves via messages. (FP is 
assumed to be a power of 2, except where noted). Sup- 
pose that a list of numbers of total length N, is seg- 
mented into FP lists of equal length L = N/P and distri- 
buted over the P processors. In the general case, a wide 
range of possible outcomes could result from the elimina- 
tion of duplicate elements, depending on the degree of 
redundancy in the data base. In an attempt to establish 
bounds on the size of the task, we shall consider the two 
extreme cases: 


(1) All elements are identical. For this case let C1 
be the total number of comparisons done in all 
processors and M1 be the total number of mes- 
sage element links. Also, define C1,,,, the max- 
imum number of comparisons performed in any 
one node, and Mi,ay, the maximum number of 
messages transmitted over any one link. 


(2) All elements are unique, i.e., there are no dupli- 
cates. In this case, let CN be the total number of 
comparisons done in.all processors and let MN 
be the total number of message element links. 
Also, define CNmax the maximum number of 
comparisons performed in any one node, and 
MN mex, the maximum number of messages 
transmitted over any one link. 


Identifying the duplicates in two ordered lists is 
equivalent in complexity to merging the lists. For all 
methods presented, it is assumed that each processor 
first sorts and eliminates its own duplicates. Since this 
requirement is the same for all methods, it has been 
ignored. Thus, to merge two ordered lists of lengths L, 


and Le requires L,+ lL, comparisons*. How can the 
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‘Throughout this paper, unless otherwise specified, log means 
loge. 

2Knuth[14] shows that the minimum possible for the worst case is 
actually L, + L,- 1 if L, and L, are approximately equal. We shall ignore 
the constant term snd assume that, in general, merging y lists, each of 
length L, can be performed in yL log y comparisons, recognizing that 
this simplification results in the ee that nETES two lists of 
length one requires two compares. 


duplicates be eliminated? 


Consider how it might be done by first eliminating 
duplicates within a list, then by comparing every pair of 
lists. Somehow the lists must be transmitted in a regular 
way so that ail lists are compared against each other. 
However, a method must avoid the problem of mutual 
destruction of all duplicates. The following method does 
this by assigning priorities to the processors: 


METHOD 1: BROADCAST / BUS ORGANIZATION 


The. P processors are ordered and connected to a 
common bus. Each processor eliminates the du- 
plicates within its own list. The first processor 
broadcasts its condensed list in sorted order to 
the remaining processors quits. Kach processor 
which receives the list compares it against its own 
elements and eliminates all elements that match 
an element of the broadcast list. The remaining 
processors sequentially broadcast their con- 
densed lists and quit. When all processors but the 
last are finished, the duplicates have been elim- 
inated. . 


This algorithm has the desirable property that the 
message size shrinks as the duplicates are eliminated. 
Thus the total length of the message units sent is the 
length of the list of all elements with duplicates elim- 
inated less the number of unique elements in the last 
processor. Obviously, this is the least possible communi- 
cation cost. 


It solves the problem of saving exactly one copy of 
each element by serializing the broadcasts. Note that 
these broadcasts cannot be done in parallel, even if mul- 
tiple busses are available. As a result, the parallelism is 
limited. On average, no more than half of the processors 
are busy. Since each list is sorted before communication 
begins, the removal of the duplicates can be done in one 
pass for each broadcast. If there are no duplicates actu- 
ally present, then the total number of comparisons CN is 
the sum of the number of comparisons 


CN = 2L(P-1)+2b(P-2)+ °°> 
= PF -1) 


+ 21,(1) 


= N(P - 1). 


Also 
MN =(P-1)L=N-L, 


MN max = MN = N -L. 


Here the broadcast of a message to many other proces- 
sors is counted as only one message sent. If there is only 
one unique element, only that one element is sent, and 
each of the other P -1 processors compare it and elim- 
inate their copy: 


Ci = 2(P -1), Climax = 2, 
and 
Mi=1, Mi1me,= 1. 


Of course, there will be some messages necessary to 
notify other processors that no more elements are to be 
sent, but this is considered overhead which: in general, is 
small enough to ignore. 
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Another sort of priority can be introduced by allow- 
ing an additional processor to do the comparison of the 
two or more lists and produce the result: 


METHOD 2: TREE ORGANIZATION 


Each processor, after eliminating its own dupli- 
cates, sends its list to its parent in sorted order, 
which merges the lists it receives, eliminating the 
duplicates, and sends the result on to its parent. 
This continues until the final list is formed at the 
root of the tree. 


In this structure there are actually (yP—-1)/(y-1) 

processors connected as a tree, where y is the branching 
factor of the tree and FP is a power of y. FP processors 
are leaves, (P-1)/(y-1) are non-leaves. The P proces- 
sors at the leaves have direct access to the data. This 
method requires more processors, nearly twice as many 
as in the previous case. It is very effective if the number 
of duplicates is large. However, if few duplicates exist, 
the length of the list will increase, by nearly a factor of y 
at each stage, increasing both the computation and the 
worst case message traffic with each level up the tree. 
Thus each succeeding step uses only 1/y as many pro- 
cessors, each of which must do y times as much compu- 
tation. For this case we calculate CN as follows: 
There are log, P levels (counting the root or the leaves, 
but not both). There are y(P-1)/(y-1) links, one above 
every node except the root. Numbering the levels in 
ascending order starting with the leaves as 0, 


level 1 contains P//y processors, each merging y 
lists of length L, 


level 2 contains P/y* processors, each merging y 
lists of length yL, 


level j contains P/y! processors, each merging y 
lists of length yJ- 12. 
Assuming that y lists of length LZ require yLlogy com- 
parisons®, we get for CN 


P log, P 
yee ¥ ". Llogy 


“ub logy + ~y*Liogy sR 
y 


or 
CN = P-L (logey)(logy P) = N log P. 


CN max is computed. for the top node, merging y lists of 
length N/y. Again assuming that y lists of length ZL 
require y £ logy comparisons, 


CN mex = v7 log y = Nlogy. 


The calculation of MN follows from the observation that 
each element goes from a leaf node to the root, i ie., 
through log, P links. Therefore, 


7 _ N 
MN = Nlog, P = iopy = P; 


Kach of the top level links carries VN /y elements, i.e., 


N 
MN max = yy! 


This difficulty suggests that the upper nodes might 
require greater power and larger memory. If all 


ee eae hee cat ee the Sa at cae a AS SE OEY OOD ce RP RS SY HR SEN ey Le He tN St WO Ot ND re EE SE 


4IStrictly speaking, this is an equality only if y is a power of 2. 
Comparison is inherently a binary operation, and a y-way merge can be 
accomplished with only about logy comparisons per element using a 
selection tree when y is a power of 2. — 


elements are identical, no such congestion occurs. Each 
non-leaf node receives y lists of length 1. If we assume 
that merging y lists of length 1 requires ylogy opera- 
tions, then 


oe re 
y 1 vey. 


Ci= 
Since each non-leaf node does the same number of opera- 


tions, 


Ci 
number of non-—leaf nodes 


Since exactly one element passes through each link, 


i Fd - 
M1 a 1), 


Climax = 


= ylog y. 


and 
M Amex = 1. 


Consider the binary tree case (y = 2). Since the lists 
were sorted before being sent to a parent all that is 
required of the parent is a merge of ordered lists, a pro- 
cedure that increases only linearly with the length of the 
list. Now suppose that instead of sending the list to a 
common parent, the two processors divide their elements 
into two lists in a commonly agreed way and exchange 
one of them. This leads to the first algorithm utilizing a 
global sorting: 


METHOD 3: BINARY MERGE /n-CUBE 


P processors are numbered in binary from left to 
right, starting with 0. Each processor eliminates 
its duplicates, leaving them in sorted order. The 
range of values of the sort field is partitioned in a 
universally agreed-upon way, (the obvious way, for 
example, is to use the most significant bit of each 
element), and each processor breaks its list into 
two parts. It then sends one of the two lists to the 
processor having the same address except for the 
most significant bit as follows: If the most 
Significant bit of the address of the sending pro- 
cessor is a 1, it sends the first list. Otherwise, it 
sends the second list. 


After merging the received list with the retained 
one and eliminating duplicates, each processor 
repeats the process, but with the following 
modification: 


Each partition of the range of the sort field is 
further sub-divided into two parts. If the 
straight-forward way is used, then on step 7, the 
jth most significant bit of the address is used to 
determine which list to send and to whom it will 
be sent. 


This process is repeated n = logP times, after 
which the range is partitioned into P parts, and 
one processor contains all the values for exactly 
one partition. If the obvious partition was used, 
all numbers are sorted into the proper list ac- 
cording to their 7 most significant bits. 


The links required between the processors form 
the n-dimensional structure known as the n- 
cube[ 15] and sometimes called hypercube. 


This procedure requires only log P steps and does 
not get more complex on subsequent steps — in fact it 
gets shorter with the elimination of the redundant ele- 
ments. After each of the log P exchanges, all P proces- 
sors merge two lists of length £2. Therefore, 
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L 


CN = (log PP ere log2 = NlogP. 


Symmetry arguments guarantee that CNway and MNmex 
are just CN/P and MN /P respectively. Assuming that 
on each move, half the elements are moved‘, 


MN = logP © = Agee 


For the unique case, after exchange j, P//2? nodes have a 
list of length 1 while the remainder have the empty list. 


logP p- 
Ci = 2(1)loge2 s a = 2(P -.1), 
j=l 


C imax = 2(1)log 2(log P) = 2log P. 


During exchange j7, P_//2/ elements are sent: 
lo P 
Miz —=P-1 
_ far 2 | 
and M imey = 1. 

_ There is nothing magic about the binary process, 
however. One could use a y-way sort and divide the ele- 
ments into y lists, sending y-1 off at each step. This 
would require fewer steps, since 


logy P < logeP 


for all values of y >2¢,P > 1. Carrying this idea to its 
extreme, we could work in base FP, in which case only one 
swap would occur. This results in the following method: 


METHOD 4: P MERGE 


Each processor orders its own list and, after elim- 
inating its own duplicates, partitions the list into 
P separate lists in a consistent way for all proces- 
sors. Numbering these sublists from lowest seg- 
ment to highest, the 7th segment is sent to the 
jth processor. kach processor retains only that 
sublist which it would send to itself, and merges it 
and the P - 1 incoming sublists as they arrive. 


This method again is near optimal in terms of the 
transmission of information, at least for the case where 
there are few duplicates. With no duplicates, each pro- 
cessor merges P lists, each of length L/P: 


CN = Pp LtogP = NlogP. 


CN max = 


P 


Each processor sends L -L/P elements:® 


log P. 


ae eee eco 

un = Pl Z| aw L, 

aa Noh io b 
' P(P-1//fe “P. 


For the single unique element case, only one processor 
receives anything: P - 1 lists of length 1. 


Ci=P(1)logP=PlogP, Clmz=C1=PlogP, 


MN 


MN max = number of links 


M1=P-1, Mimz=1. 


Another extension of Method 2 is two build a network 
which contains the interconnections for one dimension of 


RS Sa ee en ee 


4 In the worst case, when all digimants are moved each - time, 


| MN = N log P. 
| SMN = N, worst case. 
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the m-cube and has the capability to move the data 
among processors so that each exchange can be accom- 
plished with immediate neighbors. It has been shown [16] 
that both the shuffling of the data and the exchange can 
be effected in paths through only one link each for the 
network known as the perfect shuffle. This leads to our 
last method. 


METHOD 5: BINARY MERGE / PERFECT SHUFFLE 


P processors are numbered in binary from left to 
right, starting with 0. Each processor has a link 
to one neighbor whose address is the same except 
for the least significant bit. This link is used to 
implement the exchange. In addition, each pro- 
cessor has two other links to the two processors 
having the same address but shifted (end-around) 
one position. Each of these links is used once for 
each shuffle. Each processor eliminates its dupli- 
cates, leaving them in sorted order. Using the 
agreed test, (again perhaps the most significant 
bit of each element), each processor partitions its 
list into two smaller ones. It then exchanges one 
list with its neighbor. 


After merging the received list with the retained 
one and eliminating duplicates, a shuffle is per- 
formed, i.e., each processor sends its entire list. 
over the link to the processor with the same ad- 
dress shifted one position, say, left. : 


This process is repeated logP-1 times, after 
which the range is partitioned into P parts and 
each processor has all the values for exactly one 
partition. . 


The computation involved here is the same as for the 
m-cube structure, so CN and CNymexy are precisely the 
same as that case. Assuming again that on each move, 
half the elements are moved, the communication involved 
in Method 3 is again required, but additional communica- 
tion is incurred because of the shuffles. Each shuffle 
involves sending all surviving elements through one link, 
and since there are log FP - 1 shuffles, 


Ag8 + N(logP -1)= glee? -N, 


Since more traffic goes over the shuffle links, the traffic 
on the busiest link is 


MN wmox = Aceh - 1) 


MN = 


= L (log P - 1). 


For the unique case, again C1 and Clay are the same as 
for Method 3. Again an additional communication cost is 
incurred becouse of the shuffle. During shuffle j, 
j = 1,2,3,--: (logP-1), P/2? nodes transmit a list of 
length 1, hie remainder transmitting the empty list. 
Thus the additional communication cost for the shuffles is 


mr ve ap o, 


so the total communication cost is 
Mi=P-1+P-2= 2P -3. 


The busiest link is the exchange link of one particular 
processor which carries the unique element on every 


ERCHANES: Thus, 


M 1mex = we? | 


4. COMPARISON OF THE METHODS 


Table 1 compares the five methods under the 
assumption that no duplicate data exists. Table 2 com- 
pares the five methods for the model where all elemenis 
are identical. The parameters have been normalized for 
the case of all unique elements by dividing by L, the 
length of the list in each processor. For purposes of com- 
parison, the foliowing assumptions have been made: 


(1) Order is initially totally random, but the ele- 
ments aie evenly distributed among the proces- 
sors. 


(2) 


The numbers are scattered randomly, i.e. evenly, 
over their possible values. 


The first assumption seems reasonable, though presum- 
ably it corresponds to some sort of worst case. The 
second assumption, however, requires some justification. 
Normally one would expect to find severe clustering of 
the numbers resulting from the fact that they are nor- 
mally derived from natural language or other organized 
sets of data. They can be randomized, however, by hash- 
ing the sort field. The sort order is changed, making the 
end result of little use as a sorted list. This is not terribly 
important in many cases, however, since the elimination 
of duplicates is so often an intermediate result and its 
ordering is not useful anyway. 


A more serious problem is that if the hashing func- 
tion fails to randomize the data sufficiently, some nodes 
may receive very large lists. This would imply that each 
processor must have enough memory to hold the entire 
list, violating our assumptions. This problem can be 
resolved, however, by aborting an operation as soon as an 
overflow occurs, and substituting a more appropriate 
hashing function. 


The comparison of message traffic among processors 
is not straightforward if the processors in the different 
cases have different kinds or numbers of ports. One 
might reasonably expect that processors with more ports 
or faster ports would be more expensive, so that it is also 
only fair to expect more performance from them. In the 
above cases the processors vary widely in their I/O capa- 
bility, from a binary tree or the perfect shuffle, which 
need only three ports per processor, to the complete 
interconnection, which requires as many ports on each 
processor as there are processors, less one. The bus 
structure is even harder to compare, since although only 
one port is specified, it nevertheless is obviously much 
different than the port required by the other cases. 


Despain[17] has shown for a single chip computer 
that power considerations limit the total 1/0 bandwidth of 
the processors. Thus if the total bandwidth available is B 
bits/second, we can assume that each of K identical ports 
can transmit a maximum of B/K bits per second. They 
have also shown for the case where Q processors share a 
bus that a processor using the bus can achieve a 
bandwidth of only® B/(Q-1) bits/second. A further result 
is that reduced bandwidth is equivalent to an increase in 
the average path length for a message, zie., for a given 
set of message interchanges there exists an average path 
length A, such that 


A-By = 8B, 


where &,y is the effective bandwidth through a port and 8 
is the total bandwidth available to a processor. 
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®In the special case where the processors broadcast sequentially. 
In the general case where all processors are vying for the bus, it is much 


worse, ie. B/(Q- 1) *. | 


If we assume that we can obtain a structure with 
equivalent performance by reducing the number of ports 
and increasing the number of intermediate nodes 
traversed, we can define equivalence among the various 
processors by multiplying the message traffic by the 
average path length. Thus we define 


MN =A:‘MN, MNmax = A‘MN imax 
M1=A-M1A, Miyex =4'M limax: 


Tables 1 and 2 show values for the effective path 
length Aand Under these assumptions, the tree (method 
2) and the perfect shuffle (method 5) have the least total 
message traffic, regardless of the duplication factor, 
though the binary merge with the n-cube (method 3) has 
less message traffic if P is quite small. Also, the optimal 
value for y is 4, although the differences are small for 
values of 2 to 8. On the other hand, when the duplication 
is low, the tree exhibits congestion near the root, and all 


' methods but the bus are superior to the tree with respect 


to the busiest link, increasingly so with larger values of 
P. When the duplication is high, however, the binary tree 
is exceedingly effective, with the busiest link not affected 
even with increases in P. Clearly none of these struc- 
tures is best over our range of consideration. 


Some of the methods are asynchronous. The F - i 
messages that each processor sends in method 4 need 
not be sent simultaneously. Each processor can begin 
processing the second phase of the P-merge as soon as 
one message has arrived. Thus communication and pro- 
cessing can be overlapped. 


The binary merge (method 3) likewise can proceed 
asynchronously, with each node having a list of other pro- 
cessors with which it must communicate sequentially. 
Thus, either of these methods can be implemented ona 
general computer network where all nodes can communi- 
cate with all others. Both require many messages to 
many different nodes, so it should be noted that efficient 
communications are vital in the elimination of duplicates 
using a sorting scheme. 


It is interesting to observe in method @ that if the 
initial elimination of duplicates results in the elements 
being sorted, then the processors above the bottom level 
can proceed asynchronously in a pipeline fashion. Kach 
processor may begin processing as soon as it has 
received one element from each of its children. After 
selecting the lowest value of those received, it can 
immediately send this element on to its parent, and 
remove it from its own list. Thus it is not at any time 
required to store the complete lists, which may be grow- 
ing quite large. Of course the node at the top must do 
something with the resulting list, and it might turn it 
around and send it down the tree, where it can be sorted 
on the way down, thus preserving a useful sort order. 
Thus all the non-leaf nodes can be working simultane- 
ously, resulting in a higher degree of parallelism than 
might otherwise be expected. 


The model of method 2 uses up to twice as many pro- 
cessors as the other models. The difference in perfor- 
mance, however, is much greater than a factor of two. 
The amount of computation is no more than for any other 
method, so the processors on the average do only half as 
much work. The total message traffic, on the other hand, 
is much less than for any other method, for large values 
of P. The important consideration here is how rapidly 
the requirements grow as the number of processors 
grows, and in this respect, a mere factor of two is quite 
unimportant. 
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Table 1. Comparison of five methods for eliminating duplicates assuming that no duplicates exist. 
Method 1: Sequential broadcast. Method 2: y-branch tree. Method 3: Binary merge. Method 4: P- 
merge. Method 5: Perfect shuffle. C'N is the total number of comparisons done in all processors. MN 
is the total number of message element links. CN may is the maximum number of comparisons done 
in one processor. MN,,, is the maximum number of message elements passing through any one 
node. MN = MN'A is_the total message traffic adjusted to compare processors with different 
numbers of I/O ports. MN .n..% = MNmex'A is the normalized measure of busiest link traffic. A is the 
normalization factor for the variable number of ports required. 


Table 2. Comparison of five methods for eliminating duplicates assuming that all elements are identi- 
cal. Method 1: Sequential broadcast. Method 2: Tree. Method 3: Binary merge. Method 4: P merge. 
Method 5: Perfect Shuffle. {1 is the total number of comparisons done in all processors. M1 is the 
total number of_message element links. A is the normalization factor for the variable number of 
ports required. 4/1 = iM1:A is the total message traffic adjusted to compare processors with different 
numbers of I/O ports. Mla, = M1mey'A is the normalized measure of worst case link traffic. 
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Methods 3 and 4 require substantially more data 
paths than any of the other methods. Clearly the P- 
merge is not feasible for large P if P(P-1) dedicated 
links are required. Even the PL(P/2) links required by 
the binary merge are hard to justify for large values of P. 
This would mean log P links per node if dedicated links 
were used. 


A more serious problem is the lack of expandability 
imposed by these structures. A processor may have only 
a fixed number of ports, particularly if it is a single 
integrated circuit. Methods 3 and 4 require an increase 
in the number of ports as the number of processors 
grows, so that if room is left for expansion, then some of 


the available ports are unused, wasting available 


resources. 


The perfect shuffle seems to have many of the pro- 
perties needed here, though it is markedly inferior to the 
binary tree in the case of high duplication. It also has a 
large enough linear coefficient for MN that its superiority 
occurs only for large values of P. But it poses some 
unfortunate problems as well. It certainly cannot be 
gracefully expanded, since it requires a power of two pro- 
cessors. Furthermore, the routing of messages in sucha 
structure is difficulty because of its lack of symmetry. 


On the other hand, the sequential broadcast method 
takes substantially longer to execute than the other 
methods. Also, if the duplication factor is low, it requires 
far more comparisons than any other method and much 
more communication bandwidth than any method except 
the complete interconnection. 


It is clear that the bus is inferior to the other 
methods. However, the others all have shortcomings 
which are extremely serious. The question then arises — 
is it possible to construct a network on which several of 
the methods can be implemented so that the best 
method may be employed in a given situation? 


Assuming that each processor in a structure has the 
same number of ports, a significant variation among 
these structures is the portion of ports actually used. 
The complete interconnection and the m-cube algo- 
rithms, for example, use every port. But the binary tree 
uses less than two-thirds of the ports it has, since each 
leaf node has two unused ports. It has been suggested 
[18] that this is desirable to allow a convenient placement 
of 1/0 devices, a point that all structures must address 
somehow. Thus a fairer comparison might require that 
each structure have as many unused ports as it has pro- 
cessors. For methods 3, 4, and 5 this would be approxi- 
mately equivalent to increasing the value ofA by 1. 


An alternative approach, and one taken here, is to 
connect the unused ports of such a structure in some 
regular way. One possibility for the binary tree is to con- 
nect the leaves to form the perfect shuffle interconnec- 
tion (Fig. 1). The exchange can now be accomplished by 
messages exchanged through the common parent. 
Unfortunately, this doubles the traffic during the shuffle, 
which is already the dominant traffic for method 5. A 
better possibility exists. 


3. A-TREE 


A topology recently proposed in connection with X- 
TREE[18],[19] can implement any of the algorithms. The 
structure, called hypertree, is the binary tree topology, 
but with each node having one extra link connecting it in 
a regular way to another node at the same level (Fig. 2). 
The structure is particularly well-suited for communica- 
tions among leaf nodes which are nearest neighbors in 
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the n-cube. Since the structure is a binary tree, obvi- 
ously method 2 (binary tree) can be implemented 
directly on the structure. In addition, method 3 (n-cube) 
can be implemented by using the leaf processors, passing 
messages through intermediate nodes where necessary. 
Furthermore, the structure has been shown to be well- 
suited for communication among all leaf nodes, so that 
method 4 could also be implemented conveniently. 
Method 5, the perfect shuffle algorithm, could also be 
implemented, though nothing is gained by the shuffle, so 
it is essentially the same as method 3. However, the 
extra ports of the leaf nodes can be connected in a per- 
fect shuffle so that the horizontal links can be used for 
the exchange (Fig. 3). With this addition, the structure 
can perform the binary merge as well as the perfect 
shuffle network except that the value of A is 4 instead of 
3. 


Tables 3 and 4 show the values for this model assum- 
ing the structure is used to implement methods 2, 3, 4, 
and 5. The computations, of course, do not change, being 
determined by the method and the corresponding logical 
structure. Degradation occurs for the binary tree struc- 
ture because the multiplication factor A, is increased by 
one to accommodate the additional link required for the 
hypertree connection. 


Method 3 is implemented using the extra links. It 
has been shown[19] that for communication between any 
pair of leaf nodes, an optimal path exists which goes no 
more than half way up the tree. This guarantees that the 
bottleneck which would occur in the simple binary tree if 
few duplicates are present, will not occur, or at least will 
be much less serious, since a factor of VP more links are 
available to handle the traffic over the most heavily used 
path. 


The best method to use varies greatly, depending on 
the amount of duplication in the list, the number of pro- 
cessors, and the relative importance of total traffic 
versus busiest link traffic. 


For the high duplication case, the simple binary tree 
algorithm is always the best for the worst case link 
traffic, though it is slightly inferior to the binary merge 
(n-cube) algorithm in total traffic. Note also that these 
two methods have the lowest computational requirements 
as well, under these conditions. 


For the case of low duplication, the results are not so 
clear-cut. Up to about 128 leaf nodes, methods 3 and 4 
are best, with method 4 slightly preferable if total traffic 
is the consideration, and method 3 creating somewhat 
less total message traffic. Method 4 is generally superior 
at 128 nodes. 


Above 128 nodes, method 5, the perfect shuffle, 
becomes the best method because of its attractive bal- 
anced link traffic. Method 2, the binary tree algorithm, 
has slightly less total traffic, but must be rejected 
because of the excessive bottleneck occurring at the 
root, both in link traffic and in computation. 


6. CONCLUSIONS 


The proposed structure is able to implement the 
best algorithm for a given situation. Under our assump- 
tions, performance is nearly equal, and in some cases 
superior, to the structure for which the algorithm was 
originally proposed. The total message traffic, MNp/Z is 
actually improved, approximately by a factor of P/log P 
for the algorithm using the complete interconnection 
(method 4), though the worst case link traffic for the 
same model is increased by a factor of VP for the case of 
no duplicates. Since it is the method of choice only for 
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Figure &. Interconnection of Hypertree I. 
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Perfect shuffle interconnection superimposed on the leaves of the binary tree. 
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Figure 3. Perfect shuffle interconnection superimposed on the leaves of hypertree. Bottom 
level hypertree links define exchange pairs, so the leaf nodes are numbered, from left to 
right: 0, 2, 1, 3, 4, 6, 5, 7. 
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Table 3. Implementation of four methods for eliminating duplicates assuming that no duplicates 
exist and using the “hypertree” structure with the perfect shuffle as shown in Fig. 3. There are 2P - 1 
processors and 4P -3 links. The normalization factor, A, is 4. C'N is the total number of comparis- 
ons done in all processors. MN is the total number of message element links. CNyay is the max- 
imum number of comparisons done in one _ processor. MN,,,, is the maximum number of message 
elements passing through any one node. MN = MNA_is the total message traffic adjusted to compare 
processors with different numbers of 1/0 ports. MN wey = MNimaxA is the normalized worst case link 
traffic. 


Method 
2 3 4 5 

Ci 2(P - 1) 2(P - 1) P log P 2(P - 1) 
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Table 4. Implementation of four methods for eliminating duplicates assuming that all elements are 
identical and using the “hypertree” structure with perfect shuffle as shown in Fig. 3. There are 2P - 1 
processors and 4P - 3 links. The normalization factor, A, is 4. C1 is the total number of comparis- 
ons done in all processors. 1 is the total number of message element links. M1 = M14A is the total 
message traffic adjusted to compare processors with different mumbers of I/0 ports. 
M 1max = M1max4 is the normalized worst case link traffic. 
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t This formula is correct only where P is not a power of 4. If P is a power of 4 the formula is slightly different. 
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values of P < 128 this would seem to be unimportant. 


Method 3 shows some degradation in performance. 


The worst case link traffic is increased, for the case of no 
duplicates, by a factor of V2P/logP, but for large 
values of P the perfect shuffle algorithm predominates 
anyway. | 

Methods 2 and 5 show only slight, linear degradation 
due to the increase in the value of A resulting from the 
unused links for that algorithm. Thus the proposed 
structure is able to achieve the same order of perfor- 
mance as the best of the methods considered for virtually 
all circumstances under our assumptions. If other 
assumptions are made, a different conclusion could also 
be drawn, as is evident from the results presented in 
table 1. 


The power of the tree structure is clear for the prob- 
lem of eliminating duplicates. However, the importance 
of flexibility in choosing the method is apparent. The 
best structure is clearly one which can handle both 
extreme cases (and thus presumably the cases in 
between) reasonably well. 


Before an architecture is chosen for a multiproces- 
sor data base machine, other typical data base opera- 
tions must be analyzed in a similar manner. 


base operation which provides significant insight into the 
requirements for a data base computer. Other con- 
siderations as well undoubtedly will influence the choice. 
For example, unlike some other models, the tree struc- 
_ture is expandable without a modification to the proces- 
sor itself, which surely makes it more attractive if itis a 
single component. Another issue is the fact that adjacent 
processors may wish to communicate heavily at times. 
The binary tree, with nodes having fewer ports, each with 
more bandwidth, is clearly advantageous in this case. 


The X-Tree "hypertree" interconnection with the per- 
fect shuffle interconnection among the leaves has been 
shown to provide an attractive compromise of the models 
considered. It gives essentially the same performance as 
the best of the other structures over the range of condi- 
tions considered. 
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Abstract -- The purpose of this paper is to 
discuss specifications concerning peripheral 
transformation processor (PTP) systems in a data- 
base environment. 

A PTP can be seen as a link between a buffer 
system and secondary storage media for parallel 
transmission and intermediate manipulation of 
(blocks of) data. 

A PTP mainly consists of a highly modular data 
manipulation unit and a flexible control. 
Built-in fault-tolerant capabilities of a PTP 
system lead only to slight performance degra- 
dation if faulty components are detected. Typical 
PTP applications include update, simple associa- 
tive, and cryptographic operations. 


3880-like storage control systems provide 
the capabilities to operate and control several 
independent data paths between processor(s) and 
disk storage media. They do not allow, however, 
intermediate buffering and manipulation of data 
required for reducing channel and main storage 
activities in connection with database procedures. 


The peripheral transformation processor 
(PTP) to be discussed is an attempt to incorpo- 
rate certain functions of dedicated units into a 
highly paralllel peripheral processor system. In 
other words, a PTP is a special purpose function 
architecture mainly for performance enhancement 
of the corresponding overall systen. 

A PTP is by no means another database backend 
machine [1,3,5,6,11,12,13], it could be seen as 
an evolutionary step towards an intelligent, 
database-oriented processing system. 

A PTP has not been built yet. 


Special-purpose data manipulation units 
described in literature include manipulators for 
bit-slice functions [7], alignment (scramble/ 
unscramble) networks for multidimensional memory 
access (9] encryption networks for improved data 
security [8], associative or quasi-associative 
search modules [2,10,11,12] , and transformation/ 
translation units for database structure opera- 
tions [1]. 


A PTP architecture (see Figure 1) must 
satisfy the following requirements which are vital 
to I/O-intensive and robust database management. 


1. The PTP interfaces to a disk storage system 
and to a block Luffer system, allowing access 
to different disks and buffers and to cylinder 
slices of one disk at a time. 

Thus a PIP operates parallel read-out, 
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parallel write-in, and mixed read-out/write- 
in procedures. 

(Clearly, the attachment of so-called electro- 
nie disk devices is desirable for future PTP 
business). 


The PTP data manipulation unit (DMU) contains 
a network of substitution boxes which 
satisfies the NBS data encryption standard 
and allows multiple data streams as plaintext 
input and encrypted/non-encrypted output. 

A substitution box realizes a permutation on 
{0,1} as well as the identity (bypass 
feature). 

A key register is included for modification 
of the network [8]. 


A more powerful substitution network provides 
feedback capabilities for multi-phase data 
manipulation and permits Boolean operations 
between networks stages (DMU stage logic). 


The PTP masking facility is a DMU component 
which allows vertical and horizontal masking 
of (blocks of) data. 


The DMU compare logic is restricted to 
operation on a one-bit-per-word basis (fixed 
bit location), Thus a PTP does not support 
sophisticated time-consuming search proce- 
dures. 


Because of the predominant modular structure 
and the capability to divert multiple data 
streams a PTP is well-suited for fault- 
tolerant data processing. Built-in fault- 
masking and self-repair functions result in a 
high PIP reliability. 


Communication between PTP and host processes 
is governed by a higher-level protocol 
(probably ISO-level 4) in contrast to proto- 
cols which are concerned with standard 
peripheral/main processor interactions. 

In a host-backend configuration a PTP does 
not contribute directly to communication 
between program execution system and database 
computer system. 


Note that requirements 1 to 7 necessitate a 
flexible PTP microstructure. 


The operations Listed below demonstrate PIP 


capabilities concerning support of database 
management functions: 


copying, i.e. dynamic peripheral duplication 
of data 

combination of data from different resources 
(e.g. coincidence or merge of index bit lists) 


- composition of (blocks of) data from different [11] C.S. Lin, D.S. Smith, and J.M. Smith, "The 


resources (e.g. combination of database keys Design of a Rotating Associative Memory for 
and data items or of primitive search results; Relational Database Applications", ACM 
simple union operations) Transactions on Database Applications 

- insertion/deletion of words within blocks of (March, 1976), pp. 53-65 

data (e.g. for index list updates) 

~ differentiation of sets of data (e.g. before/ [12] E.A. Ozkarahan, S.A. Schuster, and K.C. 

after images, basic/index data) Smith, "RAP - An Associative Processor for 

- projection, i.e. selection of certain data sub- Data Base Management", Proceedings AFIPS 

blocks (domains ) National Computer Conference (1975), 
Pps. 379-387 
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Abstract -- A multiple module memory system is 
termed stochastically conflict-free if its perform- 
ance is (statistically) guaranteed - regardless of 
referencing behavior. A design for such systems 
has been proposed. In the present paper we present 
a formal analysis of its peformance. 


1. Introduction 


A parameterized design fora family of multiple- 
module memory systems will be termed "Stochastically 
Conflict-Free' if for any desired effective band- 
width (i.e., post-conflict bandwidth), 8, and for 
any @ < 1 and any ce > O, the design makes possible 
the implementation of a system:. 


1. whose actual post-conflict bandwidth, B! 
is a random variable - not as a result of any sta- 
tistical assumptions which might be made regarding 
the (memory module destinations of the) access re- 
quests entering the system, but rather as a result 
of an element of randomization deliberately intro- 
duced as a part of the design, 

2. for which the probability that 8’ > 68 is 
itself within e¢ of 1 regardless of the referencing 
behaviors of the devices which input access re- 
quests - i.e., for every pattern of access requests 
which might be input into the system. 


Such a design is proposed in [1] where the Stochas- 
tic Conflict-Freedom of its performance is argued 
informally. 


The present paper presents formal analysis of 
the performance of systems implemented according to 


the design and operating in what we will term data- 


base mode. That is, it begins the development of 


a formal methodology for determining the values which 


design parameters should take inorder for a system 
to meet specified performance criteria. 


The type of system which we have inmind (pre- 
Stochastically Conflict-Free) is that of Figure 1. 
It consists of some number M, of memory modules 
accessed via an interconnection structure which 
might, depending upon M (the number of memory mod- 
ules) andN(the number of ports) beas simple as a 
single shared bus or as complex as a routing net- 
work [2]. Requests for access to words (records) 
stored in the memory modules are entered into the 
system by request-issuing devices, (processors, 
query stations, etc.) each connected to one of the 
N ports; access requests traverse the interconnec- 
tion structure to arrive at queues in front of the 
appropriate memory modules, and each is serviced 
once it reaches the head of its queue; finally, the 
appropriate response to an access request is re- 
turned to the requesting device via the intercon- 
nection structure. 


In Sections 2 and 3 we review a number of re- 
quired preliminary definitions. In Sections 4, 5, 
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and 6 we motivate the proposed design, indicate its 
range of applicability, and review those of its 
details required for an understanding of the per- 
formance analysis presented in Section/?7. Finally, 
in Section 8 we present, very briefly because of 
space limitations, a short design example. | 


2. Modes of Operation | 


We will distinguish two different modes of 
operation of multiple-module memory systems, multi- 
processor mode and data-base mode, each mode 
carrying with it its own design questions. 


2.1 Multiprocessor Mode 

The use of a multiple-module memory system as 
a part of a multiprocessor (or MIMD parallel proc- 
ess) entails that the devices attached to the N 
ports of Figure 1 are processors and that the M 
modules of memory constitute that part of the 
primary memory which is shared by all N processors. 
In this case, the memory modules of Figure 1 would 
probably (today) be modules of semiconductor 
memory. 


The multiprocessor mode of operation is de- 
fined as follows: Initially each processor issues 


a request for access to a word of memory. There- 
after, a processor will not issue an additional 
request before it has received a response to its 
previous request. 


The total rate at which requests enter the 
system is not fixed in advance, but is, rather, 
self-regulating. On each cycle the number of new 
requests entering the system is bounded from above 
by the difference between N and the number of 
processors which have not yet received responses 
to their most recent requests. 


2.2 Data-Base Mode 


The use of a multiple-module memory system as 
part of a data-base system entails that the de- 
vices attached to the N input ports of Figure 1 
are query stations and that access requests are 
for records rather than for individual words. 
this case, the memory modules of Figure 1 would 
probably (today) be disks. 


In 


In the case of data-base systems one often 
ignores the rates at which individual devices con- 
nected to the ports of Figure 1 issue access re- 
quests, concentrating rather on the ensemble rate 
at which requests enter the system; this is reason- 
able: because in the case of data-base systems the 
ensemble input rate is empirically observed to 
frequently assume an almost constant value. This 
value, probably as a result of users logging onto 
and off the system in response to the quality of 
service received, is often sustained for reasonably 


long periods of time. 


The data-base mode of operation will thus be. 


defined as the case of constant input rate. It is 
this mode of operation with which we will: be con-. 


cerned here. 


3. Memory Contention 


It is, of course, precisely the fact that the 
access request traffic appearing at the memory 
module queues of Figure 1 will not be uniformly 
spread across the queues that tells us that N mem- 
ory modules will not suffice to drive N times as 
many processors as will a single memory module or 
to support a constant input rate of N times the 
response rate of a single memory module. 


This is precisely what is meant by memory con- 


tention (conflict, interference); the degree to 
which it affects the performance of a system is, of 
course, dependent upon the degree of nonuniformity 
of the memory referencing behaviors of the devices 
connected to the ‘N ports. 


4. The ee ere rae Assumptions 


The following three statistical assumptions, 
which we will refer to hereafter as the "standard 
statistical assumptions," have been used (see [3] 
for example) as a basis for the study of the effec- 
tive bandwidth to be expected from systems of the 
type depicted in Figure 1: 


1. Each individual request for access to an 
item stored in memory is to a memory module chosen 
at random from a uniform distribution over all the 
modules. | 

2. The (memory module) destinations of access 
request input by different devices (attached to 
ports of Figure 1) are statistically independent of 
one another. 

3. The (memory module) destinations of suc- 
cessive access requests input by the same device 
are statistically independent of one another. 


If these assumptions indeed hold for some par- 
ticular application, one would expect the effective 
bandwidth of an M-module memory system.put to use 
in that application not to fall very far short of 
M times the bandwidth of a single module; the rea- 
soning is, speaking very crudely, that the Law of 
Large Numbers would assure, in the long run, rea- 
sonable uniformity of spread of access request 
traffic over modules of memory. 


5. The Proposed Design 


5.i Design Strategy 


The design proposed in [1] is intended to 
allow the designer to bring the performance (effec- 
tive bandwidth) of a multiple-module memory system 
as close as desired to what it would be if the 
standard statistical assumptions held. 


In cases, then, in which the standard statis- 
tical assumptions can be assumed to hold, the de- 
sign proposed in [1] would as we will see involve 


an unnecessary additional expense. On the other 


‘hand, it has great potential for: 


1. cases in which neither can the standard 
assumptions be assumed to hold, nor are any other 
statistical assumptions regarding referencing be- 
havior known or obtainable, 

2. cases in which statistics regarding ‘ref- 
erencing behavior, although known or obtainable 
cannot be made use of in system design (hardware) 
or organization (software) - because actually ob- 
taining such statistics, or using them, would be 
impractical or infeasible. 


Given, then, that the cases in which the de- 
sign will be of interest are those for which no 
a priori statistical assumptions can be made re- 
garding the pattern of access requests entering 
the system, we will not be able, a priori, to view 
those requests as random variables with some known 
or knowable distributions and some known or know- 
able correlations to one another; rather we will 
view them as logical identifiers of items stored 
in memory and deliberately referenced by users 
(request-issuing devices) in whatever pattern 
suits the needs of those users' particular reasons 
for using the system. , | 


5.2 Details of the Design 
The design itself consists of three points: 


1. Deliberate (uniform) random allocation of 
space to items when they are allocated space in 
the modules of memory - each item deliberately 
allocated space independently of all others. 


‘This would be implemented through the use of > 
either software, or, more probably, hardware gen- 
eration of pseudo-random numbers. (In cases in 
which referencing behavior cannot a priori be 
assumed to be random (and independent in the way 
indicated), but in which allocation can, this step 
would, of course, be altogether unnecessary.) 


2. Distribution of multiple modules of a 
novel type of memory which we will call "repeti- 
tion filter memory" - RFM for short - over the 
internal components of the multistage interconnec- 
tion structure (see Figure 1) proposed in [1]. 

(The exact nature of RFM will be detailed in Sec- 
tion 6.) 

3. Increase in the number of ordinary memory 
modules - hereafter referred to as "modules of 
primary memory'’ beyond the number which would be 
required to produce the desired effective band- 
width if memory conflict did not exist. 

5.2.1 Consequences of the First Point 

The first design point ensures that even in 
the absence of any a priori assumptions regarding 
referencing behavior every access request travers- 
ing the interconnection structure is to a module 
of primary memory chosen at random from a uniform 
distribution over all M modules. It does not, of 
course, ensure that different access requests are 
to modules chosen independently of one another; 
indeed, different requests might deliberately be 
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addressed to the very same item, and therefore the 
very same module. 


The matter of how logical identifiers are 
translated into physical addresses when memory has 
been deliberately allocated at random is, by the 
way, quite simple. Each request-issuing device 
will have the highest level of file directory (or 
translation table) stored locally; the rest of the 
file directory (or translation table) will be 
stored as items in modules of primary memory in 
exactly the same way as are any other items. Ref- 
erences to items of the former type will, then, be 
handled in the same Stochastically Conflict~—Free 
manner as will references to any other items. 


5.2.2 Consequences of the Third Point 

The third design point, taken in conjunction 
with the first, ensures an increase in the expected 
uniformity of spread of request traffic over mod- 
ules of primary memory regardless of the pattern 
of logical identifiers entering the system - i.e., 
so long as not all these requests are to exactly 
one and the same item. 


5.2.3 Consequences of the Second Point 

The effect of the second design point, whose 
details we defer to Section 6, will be to enable 
the designer to ensure that repeated requests to 
any item, if they are closer together in time than 
some chosen distance (say a distance of C-1 inter- 
vening requests, where C is itself a design param— 
eter) will result in all requests but the first 
being serviced without actually being sent to the 
module of primary memory in which the item resides. 


This will, in turn, ensure, informally speak- 
ing, that, as suggested in [1], from the point of 
view of actual effective bandwidth no pattern of 
entering access requests could be worse than one 
of the form bj,b9,b3,.--,b¢,b;,b2,b3,-.--,bce, (re- 
peated indefinitely) where i d j implies by #b,- 


The sense in which manipulation of the param- 
eter C allows the designer to bring the performance 
of a system as close as desired to what it would 
be if the standard statistical assumptions held 
is that the assumptions are essentially to the 
effect that there are no deliberate repetitions 
(i.e., that C is infinite) 


The extent to which this informal argument 
translates into formal results is the subject of 
Section 7. 


6. Repetition Filter Memory 


In the design proposed in [1], M and N are 
assumed to be large enough so that a complex rout- 
ing network is required as the interconnection 
structure. As a further result of the assumed 
magnitudes of M and N the RFM has tobe modularized 
to be capable of operating sufficiently fast. 


In the present paper we will simplify matters 
by turning our attention to systems of the type 
depicted in Figure 2; i.e., we will assume that a 


single module of RFM is sufficient, and that the 
interconnection structure required is simple enough 
to be ignored. We will further assume that the 
RFM processes every request directed to it - i.e., 
every request entering the system in systems of 
the type depicted in Figure 2 - instantaneously. 


In the case to be considered here, i.e., the 
case of multiple module data-base memory systems, 
if the RFM is built of very fast technology and 
the modules of primary memory are disks, then this 
last assumption will, for all intents and purposes 
be fully justified for values of M and N as high 
as in the hundreds or even possibly in the thou- 
sands. For the case in which a complex intercon- 
nection structure is, on the other hand, required, 
the analysis to be presented in Section 7 can be 
taken as a partial analysis, concentrating on the 
"access bandwidth" of the system of primary memory 
modules rather than on the "communication band- 
width" of the interconnection structure or on the 
effectiveness of a multiple-module (distributed) 
RFM in filtering out repeated requests. 


6.1 Basic Mode of Operation 

The operation of RFM resembles, but is cer- 
tainly not identical to the operation of LRU cache 
memory. (We must, however, caution the reader that 
the reason for the introduction of this novel type 
of memory in [1] bears no resemblance whatsoever 
to the reason for which LRU cache is incorporated 
into conventional memory systems.) 


Let a],89,83,..- be a sequence of access re- 
quests entering the system; each ay is processed 
by the RFM as follows: 


i) If the item to which a, requests either 
"read" or "write" access is already stored in the 
RFM then the request is serviced there instanta- 
neously. In the case of a write access this means 
that the new value to be taken by the item is re- 
corded into the RFM entry for the item, and an 
acknowledgment is sent to the request-issuing de- 
vice. In the case of a read access the response 
to the request-issuing device is a copy of the 
item. 


In neither case, of course, is the request 
sent along to the module of primary memory in which 
the item resides. 


ii) If a, is a write request and the item in 


- question is not stored in the RFM, then the request 
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is still serviced instantaneously in the RFM. In 
this case service consists of the creation of an 

entry for the item - the new value taken from the 
write request itself - and the sending of an ac- 

knowledgment to the request-issuing device. 


Again, of course, the request is not sent 
along to the module of primary memory in which the 
item resides. | | 


iii) In the remaining case, i.e., that of a, 
being a read request for an item not stored in the 
RFM, a "pre-arrival" entry is created for the item; 
such an entry consists of the address of the item, 
and that of the request-issuing device. (The type 


of entry referred to in i and ii above will be 
termed a "post-arrival" entry.) 7 


At some later point in time the item itself 
will reach the RFM - as a result of a response re- 
turning to the RFM from primary memory. At that 
time all requests which resulted in the creation 
of pre-arrival entries for the item will be in- 
stantaneously serviced. (In actuality they will 
be serviced by the fast RFM between the arrival of 
access requests to the much slower disk memory.) 
Exactly one post-arrival entry for the item will 
be created and will be the only eres for the item 
retained in the RFM. 


In order to be sure that the item will, in 
fact, eventually reach the RFM from the module of 
primary memory in which it resides, exactly one of 
the requests which resulted in the creation of pre- 
arrival entries for it will be sent on the primary 
memory - viz. the earliest one. 


(N.B. as we have described the operation of RFM a 
read request might be responded to with the value 
of the item which was current at the time of the 
request rather than with the very latest value. 
The definition of the operation of RFM is easily 
modified if this is not desired.) 

6.2 Replacement Policy 

The capacity of an RFM, i.e., the number of 
(pre- and/or post-arrival) entries it has space 
for is, of course, limited. Just as with an LRU 
cache, an RFM will, when it has to in order to 
make room for a new entry, throw away the least 
recently referenced entry it holds. In the case 
of RFM the definition of "recentness of use" is 
that pre-arrival entries are never considered to 
be "used" after they are created; i.e., they only 
age. Post-arrival entries on the other hand can 
be "refreshed" (viz.a-viz. recentness of use) 
exactly as are ordinary LRU cache entries. 


Tf 
was not 


a pre-arrival entry whose creating request 
sent on to primary memory were ever simply 
erased, the creating request would never be re- 
sponded to. "Throwing away'’ of such an entry thus 
consists of not just erasing it, but also recon- 
stituting the request and sending it to the appro- 
priate module of primary memory. 


Finally, a post-arrival entry which has been 
written into in the RFM is "written through" into 
primary memory if and when it is thrown away. 


6.3 Effect of RFM 


We will assume, without loss of generality, 
that although access requests may enter a system 
from any of the N ports, no two requests enter at 
exactly the same time; i. e., the sequence a1,489,a3,..- 
represents the system's input in temporal order. 
We further introduce the following notation: 


1. The M modules of primary memory will be 
given addresses of 1,...,M. 

2. y will be used to denote the address of 
the module “of primary memory to which a service © 


request is sent as a result of inputting a, 
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or 


; the. 
notation y; = 0 will be used to indicate chat in- 
putting a; causes no request to be sent to any 
module of primary memory - as a result of the 
filtering effect of the RFM. 

3 ve will be used to denote the set — 


{¥asVagore se Va45—1} 

We will assume for the sake of simplicity 
that no pre-arrival entry ever has to be thrown 
away. (A pre-arrival entry has to be thrown away 
only in the very low probability case that its 
creating request: landed in a very-much-higher- 
than-average memory module queue.) The effect of 
the introduction of an RFM of capacity C = mM 
entries (m an integer) taken together with the 
consequences of the first design point is then 
that without any a priori statistical assumptions 
regarding the sequence a1,42,a3,---, for every 
de 233 yp eee: Weze Yas then either 
0 or both 


l. y=Oorz= 


2. y and z are random variables uniformly 


distribured over {1,...,M} and are independent 


of one another. 


7. Performance Analysis 

We propose to analyze the performance of the 
type of system under consideration over finite 
periods of operation. The analysis will, thus, 
include a precise indication of the length of time 
a system will have to be in operation for the pre- 
dicted performance to be achieved. : 


We will assume that the system is to be run 
at a rate of 8 = nM input requests per unit of 
time, where n < 1 and the unit of time is the 
amount of time required by a single module of 
memory to service one request and be prepared to 
receive another. We will further assume that the 
system is to be run for at least the amount of © 
time required to input C = mM requests. Our re- 
sults will apply as long as the system is in oper- 
ation for at least this length of time. 


Suppose, then, that for some 2 we run the 
system for a sequence of 2£C = &mM requests (for 
gm/n > £m units of time) where % > 1 is, for the 
sake of simplicity, taken to be an integer. 


Let: 


1. t = 1 be the time at which the first of 
the &C requests enters the system. 

2. t = Tt = &m/n be the time at which the 
last of the 2£C requests enters the system. 

3. t = t! be a random variable representing 


reels 


— 


= T 
the time at which the request which is serviced 
last has just been serviced. 

4. 8' be a random variable which represents 
the actual rate at which the system responds to 
the 2C requests, i.e., let B' = (t/t')B = 


We will pose the following question about the 
performance of our data-base mode memory system: 


Given any 9 < 1 what is the proba- 
bility P(6,8,2) that when the system 
is run for 2C or more requests, it 
will fant. to respond at an actual 
rate of 8’ 68 or greater? 


P(6,8,%) can thus be thought of as the failure 
probability, i.e., the probability that the system 
will fail to perform at a level greater than or 
equal to that specified by 8, 8, and &. 


For the purpose of our analysis we will be 


concerned with sequences of random variables x 
h h 
Kgorees ky, 
contribution of a. £8 the number of access re- 

quests arriving at the h-th memory module, 


» where ae i < i < 2£C, represents the 


1 <h<M, as a result of inputting eae! og 
xh 
If we restrict our attention to Xea1?% gen 440 
for some j, j = 0,C,2C,...,(2-1)C eon if no two 
are identical, then the 


of ano ee BY en re 


ae 1 < i < C, are independent and identically 
di seetnueed as follows: 
b xo = 1} = 1/M 
aaicaadl e  las A 
) 
peeblx... SO a Oey 
jti J 
If, on the other hand, two of As4p>Azqoeee%> 
Astc are identical, say Bate = Aig. where 
1< t< s < C, then ore is identically zero; this 


is the case because the request for a. will never 
reach the memory module in which a_ resides; 
rather, it will be filtered out of the input 
stream by the RFM. 


Still restricting our attention to a. j+1? 


Fay ¢ for some particular 4,3 =0, c, 15 OF i 


a. ‘is 
j+2’ j+ 
(2-1)C, we note that the expected number of hits 
on the h-th memory module, 1 < h < M, i.e., the 
; h h h- h 
, = + . + 
expectation of H, en X42 t tc? 
which we will denote by E(H;), is less than or 


equal to C/M = mM/M = m, reaching its maximum 
value if and only if no two of 9541945497" As iC 
are identical. (Note that in this case H. is 


binomially distributed with parameters C and 1/M.) 


Moreover, at the assumed rate of 8B = rM re- 
quests per unit of time, the sequence of requests 
5419854976 2546 is input to the system in 


mM/nM = m/n > m units of time - an amount of time 
in which a single memory module can service 
m/n > m requests. 


We define, for each h, 1 < h < M, a random 


variable gh to represent the number of access re- 


quests (resulting from inputting a, 541735427" 


as4¢) reaching the h-th memory ee in excess of 


the number of requests that it can service in the 
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time required to input C requests into the system. 
Formally, if we let a = 1/n > 1, then the distri- 


bution of sh 


is as follows: 


J 
prob(s, = of = prob {iy < on 
prob{ st = «| = prob = am + K| for k > 0 


h A 
Note that S, is not the number of requests 


remaining in the queue in front of the h-th memory 
module just after ae Fe eam ¥" have been 
input - not even for j = 0 - but is rather the 
total number of requests which have arrived at the 
h-th memory module during the period of input of 
9541795429069 85403 we have, as yet, said nothing 


about when the requests that arrive at a module 
actually arrive. 


Finally, we define for each j, j=0,C,2C,..., 
(2-1)C, a random variable 


* h 

S. = max (S_) 

J ish J 
i.e., the "excess" at the "most heavily hit" memory 
module. It is easy to see that 


probls} = Kf <M probs = «| (1) 
In what follows we will need an approximation 
of E(S$). Using (1) as well as a) the standard 
derivation of the mean deviation for the binomial 
distribution [4] pp. 176-177 and b) Stirling's 
approximation [5] p. 172 it is possible to show 


| } M 


_(o-l-aln o)m Ll (2) 


E(s*) < mL 
J TO 
We turn now to the question of the number of 
requests remaining in the highest queue. Let the 
time line of Figure 3 represent the 2 periods, 
each of C input requests, (each of duration om 
units of time) with which we are concerned. 


Precise results regarding the distribution of 
t'-t, which is the number of requests remaining 
in the highest memory module queue at time t Ts 
involve not only the excess numbers of arrivals to 
the various modules during the 2 periods, but also 
the precise times of arrival. Since such results 
appear to be difficult to obtain, we will content 
ourselves with considering the maximum value that 
t’ - t could possibly attain for any given value of 


* 
ee + S 
°(2-1)C 
We will, in effect, assume arrival times and 
identities of "most heavily hit" memory modules 
which, oe any aa value for 
$5 + s" Te ese 


Soseible number of unserviced requests in the 
highest memory module queue at time t = T. 


* * 
So + Sc + 


Pe Sie 1c will leave the greatest 


To wit, we will assume the following: 


1. There is a memory module which is a (the) 
most heavily hit module for all & periods of input, 
and this module receives a nonzero excess of re- 
quests during each period. 

2. However many access requests arrive at 
the most heavily hit memory module during each 
period, they all arrive at the very end of the 
period - i.e., too late for any of them to be 
serviced during that period. 


It should be clear that however many access re- 
quests arrive at the various memory modules over 
the 2 periods of interest, 1’ - t is maximum under 
assumptions 1 and 2 above. 


But under these assumptions we have that 


= om +S. +S, + 
T T = am 0 Cc eee 


* 
* (p21 
i.e., no requests are serviced during the first 
period, the first am requests arriving during the 
first period are all serviced during the second 
period, and during each subsequent period exactly 
am requests are serviced. The number of requests 
which then remain in the queue in front of themost 
heavily hit memory module is the sum of all the 
excesses for the 2 periods plus the first om re- 
quests which arrived during the first period, but 
were not serviced then. 


Thus we have 


2 
P(6,6,2) < probj =» ________ , 


* * * 


| a Qm 
= prob . a <@} (3) 
: a” * 

[ HAL) mts HS ct.. +8 0016 


( * ae * 2 Q 
= probiS +5 +...+ 5 akm=o (241) me 
Prop Soc eS (faaiye “8 


(Note that from the second line of (3) we can see 
that, according to our pessimistic approximation 
we cannot hope to achieve an effective service 

rate of g/ = 68 until 2/ (+1) > 6 - that is even 
ko oe = 0.) 
if So + So ee S (9-1) : 
Now, for any k > 0 (see [5], p. 242). 


YS PS oP S. oe 
probiSy+8o +--+ +80 1)¢ 


E(Sg+Sqt-+-+8ry aq 
: : -~1)¢ 
te 4 . 
2E(S5) ” 
~  k 
| oy oo 
: 2 Vn _(a-l-alna)m ef ~¥) +5 
ky /2n0 °€ 


Finally, from (3) and (4) we have 


a = @; 


| a a +3 
YoM  (a-l-alno)m [M|"™~M} Ty 
MMe calnM \ 
P(8,B2) < Jana” Seas 


) 


ee | (5) 
n- EEL) ng | 


Table 1 gives values of a-l-olna for some 
possible values of a. 


2 


(a-1-—a1 no) 
~.0048412 
-01878587 
-04107354 
-07106113 
- 10819766 
~15200581 
. 20206803 
- 2580160 
31952238 
-. 38629436 


Down anF WN re 
tot 


NER EB EEE ee 


Table 1 


The values in the right-hand column of Table 1 
indicate that for 0 in the range under considera- 


tion the factor e(¢~1-alna)m 4) our very pessi- 


tic overapproximation of P(6,8,%) decreases ex- 
ponentially as m is increased. (Note that for 
_(o-1-alng)m =e 


b 


» and for ao = ae, a>l, 


(or l-oLna)m 


=e where b > 1.) 


8. A Numerical Example 


Consider the design of a fairly large system, 
i.e., one which we wish to drive at a peak rate of 
1000 requests per unit time. Suppose that we de- 
sire an effective service rate of .99 or more of 
the constant (peak) input rate after 100 request 
cycles, and that we are willing to sustain the 
cost of 1500 modules of memory to accomplish this. 
(In an actual design study of course, a need not 
be chosen in advance; rather a and m can be traded- 
off against one another on the basis of the incre- 
mental cost of memory modules and the incremental | 
cost of RFM capacity.) . 


In the present example we have M = 1500, 
ao =1.5, 6 = .99, and 2 = 100. A few quick compu- 
tations using (5) reveal that the following failure 
probabilities can be achieved with the indicated 
amounts, m, of RFM capacity per module of primary 
memory: 7 


m | P(6,8,2) 
1000 2.8694 x 10°" 

300 7.6093 x 107” 

250 4.6847 x 10” 
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You might not be surprised to hear that seientif- 


ic processors will be successful products, but 
the question is will they be successful enough? 
Development money is costly, engineers are in 
short supply and general commercial products are 
selling well. 


As manufacturers we capitalize on experience, 
technology, manufacturability, human resources, 
service operations, investment in software and 
the general company processes that we know and 
understand. This assures a smooth running oper- 
ation, a reasonable return on investment and 
continued growth for our users. But, what about 
attempting to develop an entirely new business 
area where it may require deviation from the 
norm? Many decisions concerning new business 
ventures are made by relatively uninformed prod- 
uct planners and their technical staffs. These 
decisions affect marketability, the market as a 
whole, thousands of people (users and vendors 
alike), and the expenditure of millions of dol- 
lars. 


Now let's explore the character of this problem 
and attempt to answer some questions: What does 
"relatively uninformed" mean? What can we do to 
improve this process? What can researchers do 
to help? 


Guiding forces of scientific data processing in 
the past were Federal Government agencies, such 
as NASA in orbit and reentry dynamics and struc- 
tural analysis, the Air Forces in wind tunnel 
simulation, and the Atomic Energy Commission in 
nuclear modelling. The demand for better model- 
ling in energy conservation, the demand for bet- 
ter weather prediction, and the demand for ad- 
vances in technology assure "super computers" of 
a continuously growing market. Today, a new 
swell of attached scientific processor hardware 
design activity is emerging out of the need for 
high performance scientific processing at an ex- 
tremely low cost. Some single product vendors 
are reaping the benefits from this economical 
hardware. Even so, some of this market is not 
fully satisfied by these vendor products due to 
lack of software, support and capabilities for a 
total systems approach. 


Sperry Univac has been known for years as an in- 
dustry leader in scientific computing, and as 
with any aggressive company, will continue to ex- 
ploit this market. The logical question is, just 
what is necessary to satisfy industry's appetite 
in this fast growing and changing market? Great- 
er performance, to be sure; lower cost per compu- 
tation...yes, definitely; but industry users are 
changing in other dimensions as well. Among 
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traditional high performance scientific demands, 
users expect three primary things: 


. ease of use, the ability of a non-computer 
scientist to use the system; 


- program compatibility, the ability to capit- 
alize on millions of dollars invested 
through program development; 


. and high total system availability to pre- 
vent the loss of a half finished job. 


Yes, the climate has changed, entry into this 
market today does not have to be a head-on colli- 
sion with the "super computer" manufacturers. It 
could fill an un-filled market requirement with a 
more traditional entry for the well established 
vendors. 


The question remains, what other critical ingre- 
dients are necessary to lessen risks for large 
computer manufacturers and provide motivation to 
launch into new market areas such as scientific 
vector processing? There are no simple or pat 
answers, but careful and cautious strategic and 
technical planning can minimize the investment 
risk, establish the proper place in the market, 
and prevent false starts in implementation. [Even 
with a carefully planned strategy, most new ven- 
tures are doomed to disaster. Is it any wonder 
management is hesitant to enter into speculative 
markets? 


We as product planners search for better mechan- 
isms to provide a broader spectrum of feasibility 
assurance. What we worry about is the lack of 
knowledge and experience that could lead to mar- 
ginal gut-level trade-off decisions in design 
that might cause the demise of the product. 


In-house technical and feasibility studies, re- 
search and academic papers, and consultants are, 
to be sure, heavy contributors to support this 
critical decision-making process. Even with all 
this design resource applied at the definition 
point in the product development process, they 
cannot provide all the facts necessary for a 
proper trade-off decision. By default then, many 
decisions that should be made on a clearly tech- 
nical basis get supplemented by special consider- 
ations of strategy or policy. To illustrate 
this, if we use CRAY-1 as an experience base, 
this would imply that a free-standing executive 
system (memory manager at least) should be used 
in new or competitive designs. The question is 
then, was this successful product a result of a | 
strategic decision by CRAY, or did it come from a 
good solid technical base? Another alternative 
is, if a vendor has general purpose host 


capabilities, strategically it would be wise from 
a development, support, and system configurabili- 
ty point-of-view to assign system management 
functions or executive functions to the host. To 
illustrate this, if we use 1100 Systems as an ex- 
perience base, the technical rationale for a sep- 
arate or free-standing executive gives way to the 
strategic one that says: Sperry Univac is in the 
business and has experience and know-how for 
multiprocessor systems. This, therefore, means 
one executive (one master) in the host that man- 
ages and schedules all resources, host processors, 
all peripherals, real memory and I/O traffic. 
Studies and experience thus far indicate this is 
viable, but without peripherals (disks) directly 
on the compute engine, how much peripheral and 
input/output traffic can the system stand without 
being brought to its knees? 


There are still other ways to view these system 
design decisions. User communities, it seems, 
are beginning to demand an integrated system or 
multiprocessor approach. Even those that once 
demanded the specialized compute engine now are 
demanding this compute power together with all 
features and functionality afforded the general 
purpose commercial user. They want the benefits 
offered by a system with a full-blown operating 
system, tried and proven (stable) executive, 
FORTRAN/COBOL compilers, data management facili- 
ties, interactive capabilities, etc. With this 
they get 10-20 years of system experience that 
translate to availability. Conclusion? The 
tightly coupled (multiprocessor) approach is a 
good decision in terms of marketability. But 
what about performance penalties in living with 
constraints of coexisting with other processors 
in a multiprocessor system? The question of het- 
erogeneous processor accesses to multiple memory 
modules contains many unknowns and demands much 
study. This challenge is typical of the trade- 
off facing the product developer. 


Again, reflecting on some architectural basics 
brings.to mind another critical area of getting 
data to and from processors. The question is, 
which is the most optimal approach register-to- 
register or memory-to-memory, or some other? 
Burroughs and CRAY differ in architectural con- 
cept; it would be beneficial to have some dia- 
logue on those differences. Where are the re- 
search studies to support that decision process? 
It may even be interesting to perform them after 
the fact. This is another example of a situation 
where scientists and implementers could maintain 
a close correspondence. 


By necessity, detailed studies on a particular 
idea, design approach or concept, result in anal- 
yses that show performance in a narrow spectrum. 
For example: a certain method of combining 1024 
processors will neatly handle partial differen- 
tial equations for a wind tunnel problem. Indus- 
try then attempts to interpret, extrapolate, and 
guess how this problem would map on an architec- 
ture that covers a wider spectrum of applications. 
By necessity, industry must cover a wide spectrum 
of applications to increase quantities, amortize 


cost, and in short, make a profit. Vendors do 

not have time, money and resources to verify in 
detail those guesses made in an attempt to make 
the product more palatable to a general market. 


True, industry has millions to invest in new 


products, but budgets are always strained...there 
is no luxury here, either. 


The bottom line is that normal processes between 
researchers and industry vendors do work well 
much of the time. Researchers concentrate in. 
their specialty areas, while product designers 
are responsible to evaluate, interpret, and sel- 
ect results that apply best to them. What is 
needed is to improve upon this process? Gaps 
between research and industry are too wide. Are 
there other steps that may be taken? Is another 
level of iteration possible wherein a decision 
regarding product posture can be fed back to 
research in order to further check validity? It 
seems that if one facet of a major venture could 


be better focused and coordinated, the energy 


expended in developing better communication be- 
tween implementers and scientists would be well 
worth the labor and trouble. 
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GENERAL PURPOSE SUPERCOMPUTERS 


Burton J. 


Denelcor, 


Denver, Colorado 80205 


Summary 


There are two properties that are shared by 
atl supercomputers, namely, they are parallel and 
fast. Unfortunately, these may be the only two 
properties that supercomputers have in common. 
There are three additional properties that are 
necessary (although perhaps not sufficient) for a 
supercomputer to be ''general purpose''. These 
properties will be desirable for some super- 
computer users and irrelevant for others, just as 
the general purpose attributes of a more 
classical computer system are. 


First, a general purpose supercomputer should 


be reasonably fast in its execution of any 
algorithm that performs well on another machine. 
The intent of this requirement is that any kind 
of parallelism should be exploitable. Second, 

a general purpose supercomputer should provide 
a machine-independent programming environment; 


that is, software should be no harder to transport 


from a given computer system to a general purpose 
Supercomputer than to an ordinary general purpose 
computer. Third, a general purpose supercomputer 
should have storage heirarchy performance 
consistent with its computational capabilities. 
Such a computer should not be 1/0 bound to any 
greater extent than an ordinary general purpose 
computer is for a given class of problem. 


These three requirements are more than just 
a short list implying what is wrong with today's 
supercomputers. They are the principal reasons 
for the schism between the parallel processing 


business and the mainstream of computing practice. 


A supercomputer that satisfies these three 
requirements could enjoy a market several orders 
of magnitude larger than the current models do. 
While there will always be a need for special 
purpose parallel processors of all sizes and 
capabilities, it will be the general purpose 
super Or not-so-super computers that dominate 
the marketplace. 


If general purpose supercomputers are to 
become a reality, substantial progress is re- 
quired in three areas. First, MIMD and data- 
driven architectures offer the best hope for 
exploiting many kinds of parallelism, bat these 
architectures have few proponents outside the 
academic community. In fact, no MIMD or data- 
driven supercomputer has ever been delivered 
to a customer for trial. This situation wil] 
improve in the next few years, but until it does 
the design of these kinds of computers will not 
be able to benefit from experience with practi- 
cal applications. 
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The second area in which progress is re- 
quired is that of machine-independent program- 
ming. Two approaches are needed here: 
parallelizing compilers for existing languages, 
and new languages in which parallelism is more 
easily detectable. Although there is some 
experience in automatic vectorization of 
FORTRAN, for example, it is only recently that 
attempts have been made to find and exploit 
other kinds of parallelism in existing languages. 
It also seems clear that languages like FORTRAN 
and COBOL will be in use for a long time to 
come and will therefore need to perform well on 
general purpose supercomputers. On the other 
hand, the advantages to be gained in parallelism 
by being able to express algorithms in functional 
programming languages must not be discounted; 
it is with these languages that the future of 
very high speed general purpose computing lies. 


Finally, the single most important techno- 
logical factor in general purpose supercomputer 
development is mass storage bandwidth. The 
failure of mass storage access times and data 
rates to keep pace with the speed increases 
realized by solid state technology are wel] 
known. The effect of this deficiency has been 
to severly constrain the range of application 
of very high speed computers. A modest in- 
crease in mass storage bandwidth would have 
far more impact than more substantial advances 
in device speed or packaging density. In fact, 
only an improvement in interconnection 
technology would have as much impact on 
computation in general. 
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HIERARCHICAL ANALYSIS OF A DISTRIBUTED EVALUATOR 
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Gary Lindstrom 
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ABSTRACT 
We outline the analysis of a distributed 
evaluator for an applicative language FGL 


(Function Graph Language). 
that the least fixed point semantics of FGL are 
faichfully implemented by the hardware evaluator 
envisioned in the Applicative Multi-Processor 
System AMPS. Included in the analysis are a 
formalization of demand-driven computation, the 
introduction of an intermediate graphic Language 
IGL to aid in our proofs, and discussion of 
pragmatic issues involved in the AMPS machine 
language design. 


INTRODUCTION © 


Programming distributed 
systems 


currently, 


languages for 
are receiving increased attention 
as are languages based on function 
application. Distributed systems are of interest 
because of a desire to exploit potential 
concurrency in programs. Applicative languages 
tend to reveal potential concurrency by 
eliminating arbitrary sequencing within program 
representations, and by circumscribing 
side-effects. In addition, applicative languages 
often allow programs to be written so that their 
text closely resembles that of a correctness 
specification, thereby easing verification. 


comput ing 


Aichough the idea of using applicative languages 
as a basis for concurrent programming has come 
into vogue only recently, the reader should refer 
to the prophetic paper [1] for an anticipation of 
many of the relevant ideas currently being put 
forth. Subsequent proposals, which share some 
aspects of our own, include [2] through [7]. 


Sketched herein is an analysis (i.e. an informal 
correctness proof) of an evaluator for an 
applicative language suitable for exploiting the 
features of a distributed computing system. This 
evaluator has been proposed for use in the 
Applicative Multi-~Processing System AMPS [8]. 
Such a proof would be of interest for several 
reasons: 
1. The evaluator has been implemented ([9]), so 
it 1s desirable to certify its correctness. 


Although parts of similar proofs have been 
sketcned, notably in [10] and [ll], tnese 
proofs have been for serial evaluators, and 
are for models having fewer machine-level 
details than the one presented here. 


2. 
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Our goal is to show. 
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3. A graphical approach to semantics seems to 
us to be quite enlightening in comparison to 
the one-dimensional representations largely 
used heretofore. 


We intend the present exposition as_ the 
first step toward a more comprehensive proof 
which also involves a storage manager. 


called FGL (for Function 
features deemed 
distributed 
implementation we 
demand-driven data-flow 
effective support of the data 
of our language. The 


naturally provides for single 
common subexpressions and 


Tne extrinsic language, 
Graph Language), includes 
relevant to highly concurrent 
evaluation. The hardware 
consider includes a 
evaluator’ for 
structuring primitives 
implementation 
evaluation of 
parameters. 


Locality considerations give rise to a two-level 
evaluation strategy for the machine language 
(ML): at the intra~processor level, a rather 
rigid structure is imposed, in which each atomic 
function is executed with bounded value fan-out 
and communication delay for greatest efficiency. 
At the inter-processor level it is infeasible to 
place such a bound, as one function may well have 
to send its result to others, the number and 
locations of which are not determinable a priori. 
Block storage allocation is used in ML for the 
following reasons: 

L. It enforces locality of communication among 

nodes which are logicaliy closely related. 

2. It permits economical use of address bits by 
requiring only relative addresses within a 
block. 


It avoids the need for code relocation and 
extensive dynamic binding. 


Tuples of data values are stored as blocks, 
or pieces of blocks, permitting fast 
indexing. 

the 


Fewer interactions with 


allocator are required. 


storage 


Blocks may be transmitted and initialized in 
a "burst mode" of communication, rather than 
in a word-by-word mode. , 


Tne proof that the distributed evaluator is 
correct with respect to FGL's’ fixed-point 
semantics is complicated by the 


two-level | 


block-oriented strategy. For this reason, we 
have found it convenient to introduce a language 
IGL intermediate between the extrinsic language 
and that of the target machine. This language 
allows the analysis to be naturally decomposed 
into two levels (not corresponding to the levels 
of evaluation), but does not appear explicitly in 
the implementation. 


We express the notion of demand and value flow in. 


IGL programs as a state-transition system (cf. 
[12])}. The states are marked IGL graphs, with 
transitions expressed by a set of formal rules. 
This system is the basis for the FGL evaluator. 


The ~ notion of the correctness of such an 
evaluator with respect to FGL- semantics. is 
presented. We then discuss the proof of 
correctness of the IGL evaluator with respect to 
“ML. 

The following diagram summarizes the levels of 
the hierarchy and their functions. 

Acronym Name Purpose 

FGL _Function Graph Programming 

Language 
IGL Intermediate ‘Internal program 
Grapn Language representation 
for FGL 
ML Machine Physical program 
Language execution | 
The analysis may be outlined as follows: 

1. IGL-->FGL mapping theorem: IGL defines the 
correct result for FGL. 

2. IGL partial correctness theorem: IGL can 
produce the correct result. 

3. ML-->IGL mapping theorem: ML defines the 
correct result for IGL. 

4. IGL finite delay: ML provides a finite 
delay property for IGL, so that "can": above 
becomes "will". | 

5. Pragmatic. aspects: Certain invariants 
desired for implementation reasons hold for 
ML executions. 

FUNCTION GRAPH LANGUAGE 
Our extrinsic language, FGL (Function Graph 


Language) is Lisp-based [13], extended to include 
non=strict atomic and programmer-defined 
functions. — This permits ease in dealing 
semanticaliy and pragmaticaliy with unbounded 
data structures, as discussed in [6] and 
elsewhere. The components of such structures may 
be distributed among physical processing elements 
and concurrently constructed and _ transmuted, 
using stream-Like communication between computing 
modules wnich are both physically and logically 
distributed. | Because of the functional nature of 
FGL, Logical aspects of. the computation are 
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insensitive to delay in and among physical 
‘elements. nas 
The objects supported are not restricted to 


streams of simple components, 
or records, but also permit components which are 
functions, other streams, and generally arbitrary 
data objects. FGL allows treatment of functional 


such as characters 


objects with full MJlambda-calculus generality 
[14]. 
The cons operator of FGL permits an arbitrary 


number of arguments, thus providing an efficient 
and natural array capability. The usual car, cdr 
selectors are generalized to an indexing selector 
select. For simplicity, however, we will 
primarily use car and cdr here; car selects the 
first component of a tuple and cdr selects the 
last. Other aspects of our generalization are 


discussed in [15]. 


In this presentation, the set of data objects of 


FGL will be 
Objects = Atoms U Tuples U Graphs U {error} u{?} 
where 
1. Atoms = Integers U Characters U {niu}, where 
Integers is the set of integers’ and 


Characters is the set of characters of some 
alpnabet. We assume that NIL plays the role 

of tne Boolean value false. Any atom other 
tnan NIL and error may Bray the role of the 
Boolean value true. 


Tuples: A tuple is a sequence of N Objects, 
for N an arbitrary naturai number. 


The limit of a sequence 
nested tupies of objects, 
ad infinitum, is an object. For example, 
the stream of odd prime numbers could be 
represented as 


(i.e. "tree") of 
as nesting occurs 


C3 Og 47 Olay Ss, 250.) 3090) 
3. Graphs: We allow the enveloping of a graph, 
as described in [16], and its use as a 


function data object (i.e. as a "closure"), 
error: an error value which propagates 
itself through each function which demands 
it as an eee 


2: the undefined object, ise. the result of 
a computation which has not yet Cand might 
never) peeatce any value. 


A fully operational. system might include 
side-effect operators, but we prefer introducing 
tnem within tne context of an applicative style, 
in which the programmer is highly aware of their 
use (i.e. their use will be permitted only on 
which | 


tuples are created as explicitly 
modifiable). Side-effect' operators are not 
included in tne model presented here, with the 


exception of read and print, 


which are described 
subsequently. ss 


For the purposes of this exposition, a program in 
FGL appears as eitner a "function graph" or as a 


"set of equations" [22]. Each equation is 

determined by naming a FUNCTION being defined, > Ceons ) 

which has zero or more formal parameters. The 

function name 1s equated to the RESULT 

expression, which involves names of defined : 
functions, names of atomic operators, formal 
parameters, and imported values. Abbreviations 


of multipiy-used values are provided by LET 
expressions, which are also equations equating 
the left~nand side identifier of a BE to the 
right-hand side expression. The latter 
expression may involve the identifier on its own 


left-hand side, as can the function being defined 

involve itself. Finally, an IMPORTS declaration 

allows values defined externally to a function 
definition to be used inside the definition. 


imported 
values 


(limit) (primes) 


Aigol-like lexical scoping is used, except that 
imported values are declared implicitly. 


When a value defined in a LET.... BE.... involves 
1tself, or when a function f defined in terms of 
a formal variabie x involves the expression f(x), 
or when a value is defined in terms of an 
expression which involves the importation of the 


value itself, we say that there is an 
"applicative Loop". Such Loops permit 
implementation of data structures in terms of 
themselves, thereby providing for the generation 
of infinite data structures without either the 
obvious infinite recursion or use of side-effect 
Operators such as lLisp's_ rplaca. The latter 
often have the effect of destroying local 
determinacy, a property useful in verifying 
concurrent programs. 


| relprine | 
primes from 


As an example of a textual representation of an 
FGL program, consider the following: 

FUNCTION oddprimes( limit) 

LET primes be 


a ern = 
‘ prime 
RESULT primes Co 


WHERE 


imported . 
value 


(n) 


FUNCTION primesfrom(n) 
IMPORTS (primes, limit) 
LET rest BE primesfrom(n+2) 
RESULT if n> Limit 
then nil() 
else if relprime( primes) 
then cons(n, rest) 
else rest 


WHERE 


FUNCTION relprime( stream) 
IMPORTS n 
LET first BE car(stream) 
RESULT (square(first) > n 
or ((not divides(first, n) 
and relprime(cdr(stream) )) 


The program above generates the list of prime 
numbers beginning with 3 and not exceeding the 
value of the argument limit. It does so by 
forming a sequence of numbers, a number being 
included in the sequence only if it is prime. ea . 
The primality of the number is tested by using Figure 1: FGL graph of the Odd~Primes Example. 
lesser members in the sequence as trial divisors. 
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An applicative loop exists, in that primesfrom is 
used to define the sequence primes, but also uses 


that sequence as an imported value in its 
definition. The or above is a_ sequential 
function, in that it only demands arguments in 
sequence as they are needed to determine _ the 
value. 


An expression in FGL is formally represented as a 
directed graph, with the nodes being identified 
with the operators in the expression. We think 
of each arc in the graph as being a carrier for 
an FGL data _ object. A node defines an 
input/output functional relationship between the 
ultimate values on the arcs directed into the 
node and the ultimate value on the arc directed 
Out. (We assume that each node has a single 
outgoing arc for simplicity.) In the graphical 
form of FGL, each functional equation may be 
represented by a graph grammar production in 
which the antecedent names the function being 
defined, and the consequent presents the graph of 
the defining expression. 


The graphicai form of the preceding program is 
shown in Figure l. The applicative loop which 
results from the compilation of the textual FGL 
program is evident’ there. Although in this 


figure we represent imported values by direct 
connections into the consequents of productions, 
accurate treatment of scoping rules demands that 


productions involving imports be replaced with 
the concept of an enveloped graph, which may 
eventually be presented as an argument to the 


apply function [16]. To simplify the discussion, 
we shall not consider this treatment here. 


Certain atomic functions are provided, such as 
the following: 
add, and, divides, mult, etc. which have the 


obvious interpretation, 


cons groups its arguments into a tuple, even if 
the arguments are not completely known at time of 
application. That is, 


cons(x), Ko seeees x) = (x), Ko pees, x) 


where the right hand tuple exists independent of 
wnat the x's might be. 


select is defined by 


select(i, (x), XQ eeee, x,)) = Xi 


provided i # 2. 
when i # 2, 
for any j.- 


It is undefined if i = ?, but 
there is no requirement that x; # ?, 


car and cdr are defined by 


car (xj), Xg,¢e++, X,) = X] 


cdr (x), Xo e0ee, X,) = x) 


which is consistent with the Lisp definition when 
n= 2. 


cond is the name of the conditional 


| function 
Cives “Lt cscs theneeee else....''). | 
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nil, returns the value NIL. 
null, tests for the atom NIL. 


Additionally, there are "“pseudo-functions", such 
as print, which has the side-effect of printing 
its argument on some external device, and read 
which has the side-effect of reading an external 
device to determine its result. The use of such 
functions can be completely avoided outside of 


utility routines provided for input and output. 


Additional auxiliary functions are provided for 


extra evaluation control. Examples are seq, 
which causes its arguments to be evaluated in 
sequence, and par which causes its arguments to 
be evaluated concurrently. (Strict functions 
such as add, mult, etc. also have the latter 
effect.) 


Through the use of pre-compilation and removal of 
certain recursions and common subexpressions, our 
evaluator incurs no combinatorial explosion of 
the type which would normally occur in circular 


recursive evaluation of applicative loops. All 
theorems proved in. [10] aiso hold for the FGL 
evaluator. However, the fact that we compile 
applicative loops without additional recursions 
provides a feature for yielding terminating 
executions for evaluations which would be 


non-terminating in other systems. 
we can state 


For example, 


Theorem: The FGL evaluator terminates on some 
programs for which the evaluator of [10] fails 
to terminate. 


To prove this, consider the program (which would 


differ syntactically when presented to- the 
Friedman and Wise evaluator): | 
FUNCTION main 
RESULT print £(0) 
WHERE 

FUNCTION £(x) 

RESULT car f(x) 
Tne Friedman and Wise évaluator would recurse 
infinitely, generating 

print(car(car(car(car(...))))) 

The FGL evaluator stops (without printing 
anything) when it dynamically and implicitly 
"discovers" that f(x) is trying to compute a 


Strict function of itself. 


We do not present the fixed point semantics of 
FGL here, instead referring the reader to [16]. 
However, we give a brief intuitive description of 
these semantics. For a directed acyclic function 
graph, the meaning can be understood simply from 
the definitions of the functions assigned to each 
node. That is, the output of each node is the 
function prescribed for the node applied to the 
input values of that node. Note that this makes 
sense even if the graph is infinite, so long as 
each path from each of the graph's inputs to its 
output is finite. 


In FGL, 
finite, 


the program representations are always 
but these representations can be 


understood by (but are not implemented by) 
expanding the representations into acyclic graphs 
wnich are sometimes infinite. Namely, 

1. Each node having a function prescribed by a 
production 1s effectively the same as 
replacing that node with the consequent of 
the production. 


2. Each cycle in the graph can be "unwound" by 


repeated "node~splitting" to obtain an 
equivalent infinite acyclic graph. 
The validity of this means of understanding 
depends on the fact that all FGL functions are 


"continuous'’ over an appropriate Scott data-type 
ordering. Although this fact is used later, 
space does not permit further elaboration of its 
meaning, and the essential ideas may _ be 
understood without it. The reader may refer to 
[16] for further explanation. 


further 
idea ‘is 
Further discussion may be 
We henceforth understand by the 
acyclic 
The above description 
"least fixed point" 


Space Limitations also preclude 
definition of 'node-spliting", but the 
reasonably intuitive. 
found in [16]. 
graph as determined above. _ 
is equivalent to the 


semantics of FGL programs, which is_~ also 
equivalent to the viewpoint of the program as a 
system of equations. It also points out the 
determinacy of FGL programs, i.e. that each 
program represents a unique function. 


The diagram of Figure 2 illustrates the scheme of 
evaluation in the odd primes example of Figure 1. 
It snows the loop formed by using the sequence of 
primes being generated to assist in their own 
further generation, as well as concurrent 
evaluation of primesfrom for different arguments. 
The dag form resulting from unwinding the cycle 
is snown in Figure 3. 


Implicitly included in an evaluation such as the 


one above 1s an arbitrary number of 
"pyroducer-consumer" relationships which the 
evaluator must implement so that needed values 
are produced and used consistently, independent 


of system-wide interleaving. These evaluations 
could be distributed among processing elements to 
heighten concurrency and thereby reduce computing 
time. The arbitrary fan-out of values, alluded 
to earlier, is quite apparent in the diagram. 


It should be noted that the definition of FGL 
semantics is embodied in the language, not the 
evaluator. That is, its semantics are given 
denotationally, by specifying the semantics of 
each of the atomic functions. This is why we 
prefer to use the term "lenient cons" instead of 
Saying that we have a "lazy evaluator" [ll]. For 
a denotationally-defined language, an evaluator 
1s either correct or is noc. Similarly, if one 
wishes a cons to have a different effect, this 


amounts to a redefinition of cons, not a change 
in the evaluator. We happen to prefer’ the 
lenient version of cons as a standard, but our 
results in no way rely on the presence of this 
Operator. We can also include other forms of 
cons (with different names, of course). The main 
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Figure 2: Expansion of the Odd-Primes Example. 


here is 
"forward 


reason lenient cons is of interest 
because it is the source of a need for 
chaining", to be discussed later. 


A single equation, which defines the "top level" 
function main, acts to drive the others, its 
value being demanded externally by the system. 
In a sense, it is the goal of the evaluator to 
produce the "value" of main. For example, we 
might include the definition of oddprimes above 
in the following program, which reads a number, 
then prints all odd primes not greater than that 
number. 


FUNCION main 
RESULT printall(oddprimes(read())) 
WHERE 


FUNCTION printall(x) 
RESULT if null x 
then nil() 
else seq(print car x, 
printall cdr x) 


The program above, when run on our evaluator, 
will not produce a particularly high degree of 
concurrency. However, it is a simple matter to 
enhance itS concurrency with the special 
Operator, par, which is functionally transparent 


(it is the identity function on its first 
argument), but which has the effect’ of 


introducing additional demands for values. In 
the present example, we need only modify the 
definition of primesfron, obtaining the 
following: 


FUNCTION primesfrom(n) 
IMPORTS (primes, Limit) 

LET rest BE primesfrom(n+2) 
RESULT if n > Limit 


then nil() 
else par( 
if relprime( primes) 
then cons(n, rest) 
else rest, 
rest 
) 
In this example, the sub-ex pression 


primesfrom(n+2) is demanded concurrently with the 
testing of relprime(n, primes), so that the 
latter does not cause the generation of the 
sequence of primes to be sequentialized. Since 
common sub-expressions are identified as the same 
value, the same value of primesfrom(nt+2) will be 
used in evaluating the if.... then.... else.... 
-No recomputation will take place. 


TARGET MACHINE LANGUAGE 


As mentioned previously, our ultimate motivation 
for the FGL evaluator is its realization on the 
highly parallel machine architecture AMPS. While 
the physical details of such a machine are not 
relevant here, its language ML and execution 
Semantics are. Hence we include here a brief 
sketch of these aspects. 


The machine consists of a large number of 


identical processing elements (PEs) , each 
possessing a portion of a uniformly~addressed, 
but physically. distributed, memory. The 


fundamental observable action in a PE is a task, 
involving bounded space and time behavior, such 
as the execution of an atomic function or the 
propagation of a value instance or a demand. 


Parallelism is achieved by exporting, to 
neighboring processors, function application 
tasks which have been spawned by _— strict 
operators. Unlike the proposal of. [7], ‘no 
"sergeant" tasks are generated for computations 
which might not be required. However, the 


programmer may include functions, such as par in 
the preceding example, which cause such tasks to 
be generated. Further ‘aspects of resource 
control in FGL are discussed in [17]. 


Unlike FGL, not every interconnection of ML 
Operators is a valid program. For example, it is 
possible to construct — incorrect linkages. 
However, the compiler insures that only valid ML 
programs are generated from their FGL inputs. We 
have insufficient space to include a presentation 
of what is or is not valid in ML. 
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Figure 3: Dag form of the example in Figure 2, 


Eacn programmer-defined function is represented 
(in pure code) as a block encoding of its graph. 
Tae code inside a block has roughly one word 
corresponding to each node. A typical code word 


contains the name of the node's operator, local 
(relative) addresses representing the node's 
arguments, and space for local notifiers 


(addresses used to tell which other nodes are to 
be informed when tne node's value is ready). 


The action corresponding to application of an FGL 


production is triggered as each instance of the 
antecedent is demanded. This action entails the 
allocation of a block into which the encoded 
graph is copied, and the Linking the arguments 
and imports of that block with the block 


containing the antecedent, in effect splicing the 
graph represented by the code in place of the 
antecedent itself. 


Evatuation of a node entails overlaying the node 
with its result. Of course, its notifiers are 
first temporarily saved by the processor, as they 
occupy some of tne space required by the result 
itself. Here we see a contrast in that FGL 
values are viewed aS appearing on the arcs, 
whereas ML values appear as transformed nodes. A 
more important contrast is that FGL objects can 
be infinite, whereas ML objects must each fit 
into boundea space. 


With these considerations in mind, the FGL model 
must be refined toward the target machine 
representation so that fixed word and block sizes 
are possible. In particular: 


1. Because of their disparity in size and 
meaning, local addresses cannot be freely 
converted to global addresses and 
vice-versa. Instead, special operators are 


provided at compile time to interface from 
one block to another. 


While arcs within a block have 
bounded fan-out, global arcs can experience 
unbounded fan-out (e.g. due to multiple 
remote demands on a given tuple component's 
value). 


statically 


Leo 
2 


In distributing values according to (2), the 
ML evaluator snould not create new nodes 
(words) to mediate dynamic fan-out, lest 
storage management become more complicated. 


4. The ML evaluator evaluates tasks using a 
task list which is generally distributed 
over the available processing elements. 


This list is used to determine an ordering 
of the application of transitions. Not all 
properties of the ordering are important. 
It only matters that once a transition rule 


is eligible for application, it does 
eventually get applied. This effect is 
achieved by FiFO queuing in ML, ~ and 
finite-delay is the corresponding property 
in IGL. 
As remarked above, ML code blocks use _ small 
relative addresses to express the local 


305 


address operator operands notifiers 


Ww PM 


connectivity within a function graph. Global 
addresses are used to represent objects 
referenceable across code block boundaries. 
These include references to function definitions 


(pure code) , tunction closures, function 
applications (for passage of parameters, globals, 
and result), and tuple values. In ML execution 


diagrams, e.g. Figure 7 global addresses will be 
represented by arcs witn hyphenated lines. 


The principal Operators involving global 
addresses are forward and fetch. The operator 
forward connects a local argument (e.g. a 


function result value) to a global demander (e.g. 
its place ot application). The operator fetch 
does tne complementary action. It may be noted 
in each step that global address arcs only 
emanate from forward nodes, and that no new nodes 
are created in any step. Thus the creation of 
global pointers and the use of existing code 
Space is well-disciplined. 


Task list: a Task list: b,c 


—> 


Block contents: Block contents: 


_ on > 


_ a PB wD 


Figure 4: Example of execution in ML and the 


corresponding IGL transition. 


INTERMEDIATE GRAPHICAL LANGUAGE 


In attempting to prove that ML is a valid 
implementation of FGL, the ‘disparity between the 
two languages seems best approached by the 
introduction of a third graphical language, IGL. 


The data objects of IGL are close to those of 
FGL, except that they use references, whereas FGL 
avoids references in favor of objects with more 
mathematical elegance. 


address operator operands notifiers 


The IGL objects are: The precise operational behavior of our IGL 
evaluator, as well as its correctness with 
respect to the denotational semantics, will be 
approached in terms of "marked IGL graphs", which 
refiect demand and data flow in a manner similar 
to ML. 


1. atoms, as in FGL. 
2. 2, the undefined value 


3. error, the error value 


A marked IGL graph is an IGL graph in which each 
node is eitner marked *, for demanded or 
unmarked. Marked IGL graphs are tne states of an 
abstract state transition system (cf. [12]) which 
models the fiow of demand and values among nodes. 
The transitions in this system are based on 
transition rules for each of the node operators, 
as determined by the type of these operators. 


4. references, of one of two kinds: 
a. tuprefs, references to tuples 


b. coderefs, references to master. copies of 
code blocks : 


ce funrefs, references to function closures 
(i.e. pairs consisting of a coderef and 
a tuple of imported values) 


5. tuples of IGL objects of the above types 
only. (Tuples with tuples as components are 
not allowed. These must be provided by 
references.) 


Unlike FGL, IGL objects must be finite. There 
are no iimit objects. Instead, limit objects are 
implicitly represented by fixed points of 
equations, as will be described presently. 


ML and IGL have the same data objects in common, 
but ML is more restricted in the way it can 
handle those objects, and includes the special 
linkage operators mentioned in the _ previous 
section. Another common characteristic between 
ML and IGL is that both are viewed as replacing 
the operator nodes with a value, whereas FGL is 
viewed as producing a value on an arc. Hence, we 
introduce the intermediate language to provide a 
convenient link between a very mathematical 
language on the one hand and a very pragmatic 
language on the other. Table I summarizes the 
differences between FGL, IGL, and ML. Like FGL, 
each arc in an IGL graph determines (again, by 
fixed point semantics) a data object. However, 
we need to progress toward the operationally 
defined ML. Hence we must at this point give an 
alternate, operational, definition of IGL which 
relates to its denotational definition in an 
obvious way. Accordingly, we choose to think of 
the nodes of an IGL graph as having values, which 
are identifiable as the same values determined on 
their (single) output arcs. Operationally, an 
IGL node will ultimately be replaced with that 
value if there is a demand for it. Another way 
of viewing this replacement is that the function Figure 5: Transitions between marked IGL graphs. 
in the IGL node is changed to a constant function 

having that value. 


Values manifest | on arcs on ares, or replacing nodes 
replacing nodes 


Infinite values | allowed . not allowed not allowed 

Fan-out arbitrary bounded bounded 

Linkages implicit in tuples and selectors tuples and selectors 
productions converted to 


fetches and forwards 


Table I: Comparison of the three language levels 
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We List in Figure 5 some of the rules in terms of 


markings. An evaluator becomes completely 
specified when the transition rules are 
accompanied by a specific order for their 
appiication. However, in a distributed system, 


this order will be difficult to control. 
instead of giving a rigid order, we assume for 
IGL oniy a finite-delay property: A rule cannot 
remain applicable forever without being applied 
by the evaluator. This property is insured by 
the ML realization, as will be later sketched. 


Thus, 


IGL TO FGL MAPPING 


The use otf IGL as a conceptual "implementation" 
of FGL is achieved through the mathematical 
device of a mapping from the data values and 
Operators ot IGL to tnose of FGL. As mentioned 
previously, tne main distinction to be drawn 
between FGL and IGL lies in the data types. 
Whereas tne FGL data types are based purely on 
machematical structure, IGL introduces objects 


wnich reter to parts ot the graph to aid in the 
progression toward ML. 


Anotner distinction between FGL and IGL of a more 
technical nature that tne arguments’ and 
imporcs to function objects in FGL are achieved 


simply by splicing tne appropriate arcs together. 
In IGL, tnis effect is created by packaging into 


is 


separate tuples tne arguments and imports. These 
tuples reside in tne applying block and_ the 
environment block, respectively. Selectors are 


used inside the applied block which accesses the 
tuples. 


We have aiready discussed how a unique FGL object 
is determined on eacn arc of an FGL program, 
given that each of its input arcs have been 
asSigned values. In the context of such an input 
assignment, if x is an arc, tnen we denote the 
determined value by Fval(x). In a similar way, a 
unique IGL object is determined on each arc of an 
IGL program, and we denote this value by Ival(x). 


Tne IGL program grapo gives rise to a system of 
FGL equations wnose least fixed point defines, 
for each IGL object x, a corresponding FGL object 
n(x) as follows: 

le. Lf x is ?, error, or atom then h(x) = x. 


2. If x is a tupref, referring to 
(x), Ko yeeees x); 


then h(x) = cons(n(x,), n(x,),.---, h(x,)). 
if x is a funref, then h(x) is tne function 
grapn referenced by x, together witn bound 
imporc arcs as determined by the tuple part 
of the referenced closure. 


Baca arc of tne FGL graph can be identified as a 
unigue arc of tne IGL graph. Since IGL has 
additional operators for linkage, tne converse is 
not true. 
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The Link between the partial correctness of IGL 
and that of FGL may now be stated in terms of an 
equation involving the mapping h. 


IGL-->FGL mapping theorem: For any arc x of an 
FGL graph, 
Fval(x) = h(Ival(x)) 


To prove this theorem, we need only observe that 


h is a homomorphism from the space of IGL 
functions and domain to the corresponding FGL 
Space. Here we may rely on the dag forms of the 


corresponding IGL and FGL _ programs. The 
technique is essentially that explained by [18]. 
[ll] presents a similar theorem, stated in terms 
of a “semantic memory" instead of FGL arc values. 


Since it is generally meaningless to speak of an 
evaluator producing a full FGL object, we phrase 
our definition of evaluator correctness in terms 
of IGL objects, as follows: 


IGL Partial Correctness Theorem: If q is a 
state and x an arc marked demanded in q, and 
Ival(x) # ?, then the IGL evaluator can reach a 


state q' such that x is marked with its 
corresponding IGL value. 
To justify this theorem, we identify node x as 


the node having x as its output arc. Consider 
the corresponding dag structure of the IGL graph 
with root node x, assuming now that Ival(x) # ?. 
Then either node x is a constant function having 
value Ivai(x), or x produces Ival(x) based on the 
values of its arguments. In the first case, one 
transition rule gives us the desired result. In 
the second case, the inductive assumption is that 


tne arguments evaluate appropriately so that 
evaluating the function in node x gives’ the 
desired result. Thus, the inductive conclusion 


telis us that these arguments can be produced by 
application of tne transition rules. Therefore 
application of one or more transition rules for 
the root node will produce Ival(x). 


Tne above use of induction 1s_ technically 
justified from the continuity of IGL operators. 
Informally, this says that a finite value (e.g. 
any IGL value) producible from an arbitrary 
composition. of operators is also producible from 
a finite truncation of that composition. For a 
further discussion of such uses of continuity, 
see [19], [20], or [16]. 


Given this partial correctness, we have_ the 
corollary that any finite piece of an IGL value 
can be produced by an appropriate set of demands. 
Simply affix to the are in question a 
supplementary function graph of selectors which 
evaluate to tnat piece formaily, then apply the 
above criterion to the output of the 
supplementary graph. 


By assuming that the underlying IGL evaluator has 


tne finite-delay property, the "can" above 
eifectively becomes "will". This property is 
provided in tne definition of ML. This approach 
is mecessary since there is no mechanism for 
1nsuring the finite-delay property within IGL 
itself, 


In the next section, we appeal to ML to provide 
the necessary infrastructure for total 
correctness of the IGL evaluator. 


ML TO IGL MAPPING 


As stated earlier, ML and IGL have the same data 
objects. As the corresponding ML-~>IGL mapping 
is rather trivial, involving only replacement of 
linkage operators by identities, it will not be 
elaborated upon here. Furthermore, both IGL and 
ML are evaluated by changing the operations of 
their nodes into values. In ML however, we 
provide an implementation of  demand/value 
propagation symbolized by markings in IGL. © 


In IGL, the presence of a demand for a node's 
value is. indicated by marking the node with an 
asterisk. It would be infeasible, in ML, to 
search the memory for all demanded nodes each 
time a new value is computed. Instead, ML 
employs a task ‘list = structure which contains 
pointers to ‘all nodes on the wavefront of 
demand/value propagation (see Figure 6). The 
wavefront be ‘thought of as initially 
propagating in the direction opposite to the 
argument arrows . and being reflected. in_ the 
Opposite direction when computed values are 
encountered. 


may 


dema nid, 


propagating 
values 


Figure 6: Wavefront of demand/value propagation. 
Nodes a and b are currently on the task list. 
will be evaluated and notify ec; 
its demand to d and e. 


a 
b will propagate 


For demanded nodes not on this wavefront, the 
fact that the node has been demanded is. recorded 
by the presence of a notifier or a forward 
pointer (see next section) in some other demanded 
node. Thus, consider tne following definition of 
a set of nodes S: 
1. Nodes on the task List which do not yet have 
a value are in §.‘4 


ea wavefront 


* propagating 
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induction (cf. 


parsimonious evaluation is also achieved. 


If x is a node in §, and x contains a 
notifier or forward pointer to node y, then 
y is in S. | | | 

3. Ali nodes in S are there because of one of 
the above reasons. 


Wavefront Lemma: 


S consists of exactly the 
demanded nodes. an 


The proof of the above lemma is by transition 
[12]) on the ML transition rules. 
Ail initially demanded ML nodes are externally 
placed on the task list. A case analysis of the 
ML transition rules reveals that any newly 
demanded node is put in S. Similarly, any node 
which is. replaced with its value cannot remain in 
S, but nodes requiring that value are put in S. 


We repeat that the finite-delay property for IGL 
means that every demanded node in a given state,,. 
if entitied to eventually receive a value 
(because the IGL output value of that node is not 
?), will receive a value. As is well known, FIFO 
processing of nodes in a directed graph gives 
rise to breadth-first visitation of the nodes, 


1.e. tne wavefront effect. By processing the 
task last in FIFO order, it is clear that any 
node in need. of attention eventually receives 
that attention. In particular, every node gets 


attention wnen it is first demanded, and when it 
is able to compute its value. | | 


In the proposed AMPS. architecture, the ‘task list 
is not monolitnic, but instead is distributed 
among many processing elements. However, each of 


the segments is processed in FIFO order, so the 
same wavefront effect is obtained. 


PRAGMATIC ASPECTS 


Although not required for correctness as stated, 
That 
1s, each node is evaluated at most once, since 
the. presence of a notifier inhibits potential 
secondary demand propagation. This idea, applied 
to the cons operator, was called "suicidal 
suspension" in [10]. It has also been used in 
Operating systems (e.g. the ‘dynamic’ linking 
mechanism of Multics) for some’ time. Our 
evaluator’ includes this technique for all 
operators. —— 


ML includes additional operators apart from IGL, 
namely the special operators used to control data 
flow across block boundaries. Specifically, 
whenever a selector in one block refers to a 
tuple in anotner, tne selector is replaced with 
the special fetch operator which matches a 
forward operator in the tuple component. The 
fetch operator contains the global address of the 
forward. A demand of the fetch (which occurs 


A 
(a) péeniige of some 
a node couid be on 


redundancy in the evaluator, 

the task list and have a 
value. For example, it could be notified by two 
different nodes, and become evaluated before the 
second notification “takes effect". 


automatically when tne selector is demanded) is 
tnen propagated to tne forward, which propagates 
tne demand to another operator local to its 
biock. At tne same time, a forward pointer back 
to the fetch is set to point to tne forward, so 
tnat wnen the demand is satisfied, the forward 
will know where to send the result. A 
fetch/forward pair is also used to pass’ the 
result of tne biock to its destination. 


A possible alternative to forward chaining is to 
use "busy waiting’. That is, the second and 
subsequent fetches for the same value are simply 
re-cycled back to the task list to be re-tried 
again and again. This solution is viewed as 
unacceptable, as tne wait can be arbitrarily 
long. 


As described tnus far, the ML fetch/forward pairs 
resemble identity functions which carry out the 
linkage needed to implement an arc crossing 
biocks in IGL. However, a complication arises 
when tnere is more than one demand on the same 
component of a tuple. This complication was not 
mentioned in [10] where it does not occur because 
evaluation 1S sequential, but neither was it 
mentioned in [7]. The property asserted there of 
the existence of at most one reference to any 
"suspension" seems infeasible for a _ parallel 
evaluator, as we now discuss. 


Since the number of demands may, in principle, be 
arbitrary, there is no fixed word size which can 
accommodate sufficiently many forward pointers. 
Hence a scheme called forward chaining is used. 
This scheme maintains the invariant (provable by 
transition induction) that at most one forward 
pointer is ever stored in a given forward node. 
Tnis is accomplished by having each additional 
fetch to the same forward operator assume the 
responsibility for forwarding to the location to 
wnich the forward pointer pointed, while the 
forward operator then points to the most recent 
fetch only. The handling of fetch and forward in 
ML is demonstrated in Figure 7. 


Although there is no limit on the number of 
(locaiL) nocifiers a node may entail, the number 
actually needed in each case can be detected at 
compile time. Hence it is possible for the 
compiler to cascade extra identity operators in 
such a way that the number of notifiers for each 
node does not exceed the maximum pragmatically 
allowed. 


CONCLUS IONS 


We have described some considerations whicn arise 
in tne evaluation of an applicative language in a 
manner capable of exploiting a multiplicity of 
physical processing elements. The present 
exposition focuses on the analysis of a hardware 
evaluator for tne AMPS system. In addition to 
tne grapnicaliy-represented extrinsic language 
and macnine language, an intermediate graphical 
language nas been introduced, to separate 
questions of value flow from more pragmatic 
issues of communication and demand flow. 


Tne important aspects of this work thus concern 
tne distributed evaluator itself, the analysis 
techniques, the graphical models, the 
formalization of demand-driven computation and 
accompanying correctness criterion, and further 
tecnnical exposition of machine evaluation of 
unbounded data objects. 


We view this analysis as a step toward a proof 
for a fuller system in which a reference-counting 
storage manager is implemented (cf. [21]), as 
well as other language and pragmatic issues, such 


a Snared resource management and load control 
17]. 


Figure 7: Example of forward chaining in ML. 
Hyphenated arcs denoted global addresses. 
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SPECIFICATION AND SYNTHESIS OF SYNCHRONIZERS 
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Abstract -- Presented is a specification 
language for expressing properties required among 
operations accessing shared resources in a 
concurrent environment. Such  constraints’~ are 


necessary in order to maintain the integrity of 
resources. 


Logic and possesses constructs for expressing, in 
a natural manner, properties such as mutual 
exclusion of operation execution, priority among 
operations, invariance of resource state, and 
scheduling disciplines. Each of the above 
properties is expressed independently of the 
others resulting in modular specifications. An 
algorithm is outlined for systematically 
synthesizing code for a synchronizer from the 


given specifications. Synthesis is achieved by 
successive transformation of the specifications 
into target language code. Feasibility of the 
specification and synthesis technique is 
demonstrated by applying it to a standard 
synchronization problem. 


INTRODUCTION 


Two main approaches exist for the development of 
any provably-correct software system. The first 
involves construction of programs, followed by a 
posteriori verification that the program meets 
intended specifications. In this case, the 
specifications themselves often provide a 
descriptive role rather than a prescriptive one, 
Since there are many different sets of 
specifications which can be met by a= given 
program. The second approach involves automatic 
synthesis of a program directly from the 
specification. Here the specifications must be 
sufficiently prescriptive to enable a synthesis 
to be carried out. 


This material is based upon work supported 
by the National Science Foundation under Grant 
MCS~77-09369 AQ1. 
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The language is founded on Temporal. 
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The advantages of the synthetic approach are 
therefore that the tedious task of a posteriori 
verification is eliminated and the specification 
is required to be sufficiently free of ambiguity. 
The disadvantages are that synthesis algorithms 


are difficult to devise, such algorithms 
themselves must be verified (but this is a 
one-time cost), and the results of a synthesis 
algorithm sometimes have less efficiency than 
desired. 


In this paper, we suggest some principles for 
construction of a specification language and an 
accompanying automatic program synthesis system 
for synchronizer code. A system of this type 
would accept specifications that characterize the 
synchronization problem to be solved and 
generates a program that conforms to the problem 
description. The solution proposed uses temporal 
logic as the basis for the semantics of the 
synthesis system. Our approach consists of: 

1) Designing a rich class of primitives and 


constructs for a high-level language in which 
synchronization properties can be ex pressed 
unambiguously in a non—procedural form. 

2) Devising a methodology for algorithmic 
translation of specified properties into 
appropriate target language code for a 
synchronizer. 

The temporal approach to specification and 


implementation of synchronization carries with it 


the advantages of a unified approach. When we 
refer to ordering of operations, scheduling 
discipline etc., the underlying concept is 
temporal ordering. Thus it is appropriate to 


adopt a system of reasoning based on temporal 


logic for expressing the semantics of 
synchronization of concurrent processes. 
Since the reliability of programs that’ share 


resources depends upon the correctness of the 
underlying synchronizer, it is highly desirable 
that the synchronizer construction be as reliable 
as possible. Automating the synthesis of 
synchronizers is proposed as a technique which 
will aid in the development of reliable programs. 


[he Specification Language 

The approach taken is to systematize and abstract 
features of synchronization control into a set of 
language constructs based on Temporal Logic which 
provides an excellent natural tool to express 
both invariant and time-dependent properties of 
software systems [15]. Current. specification 
techniques do not handle both types of properties 


as uniformly as the temporal approach does. Use 
of temporal constructs such as ‘'‘henceforth', 
‘eventually' and ‘until', along with the 
constructs derivable from’ them, result in 
intuitive specifications for synchronization 
problems. 

The specification language satisfies the 
following criteria: 

-- It facilitates expression of the complete 


semantics of a system of concurrent processes, 
providing constructs for specifying constraints, 
invariants and other behavioral aspects. 

-- It is modular and easy to apply. 


The language constructs are able to independently. 


express, properties 
constraints, priority 


such as scheduling 
of operations, mutual 
exclusion of operations, invariance of resource 
state, absence of starvation and other relevant 
properties. Each construct has an appropriate 
formal temporal semantics. Language features such 
as arrays of operations and macro notation can be 
used to enhance the readability and succinctness 
of the specifications. 


Another aspect of synchronizer behavior desired 
in the final implementation of most schemes is 
'fairness'. Our specification language provides 
for expressing a fairness criterion appropriate 
for the problem under consideration. 


The Synthesis Algorithm 
Given the specification of the desired behavior 
of a synchronizer of operations, the second goal 


is to’ develop an algorithm for automatically 
synthesizing a synchronizer in a= prespecified 
target language. The synthesis algorithm 
successively transforms the specification 
Statements, applying appropriate 
meaning—preserving transformation rules, each 


step bringing the resulting statements closer to 
the target language code. The transformation is 


complete when the derived statements can be 
mapped directly into primitives of the target 
language. Also, the resulting synchronizer will 


display the desired fairness. By requiring the 
derived statements to retain the semantics of the 


top-level specification, the resulting 
synchronizer need not be verified for 
correctness. Instead, only the synthesizing 


algorithm need be verified. 


By providing a specification language based on 
temporal logic, and a synthesis algorithm that 
guarantees the validity of the specified 
properties, this work will contribute towards 
better specification techniques, and construction 
of reliable software for concurrent systems. 
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The paper continues with a presentation of the 


‘synchronization model. The specification language 


and the synthesis algorithm are then developed. 
Discussion of related work precedes concluding 
remarks on the proposed approach. 


THE SYNCHRONIZATION MODEL 


To maintain the integrity of a shared resource, 
an answer to the question, "Who is to access the 
resource, when, and how?", is essential. A 
protection mechanism is responsible for who 
accesses the resource and how the resource is 
accessed. On the other hand, the synchronizer is 
responsible for when the access actually takes 
place. In this paper, we shall be concerned with 
the problems of synchronization. 


A synchronizer, in our model, is a centralized 
sequential process that guarantees disciplined 
access to shared resources. Access to the shared 
resource is through specific operations, the 
execution of which is controlled by the 
synchronizer. Constraints essential for 
maintaining the integrity of the resource are 


enforced by the synchronizer. Concurrent 
processes access the shared resource by 
requesting execution of any of the specified 
operations. A request for an operation on a 


shared resource is serviced by the synchronizer 


after ensuring that the constraints are not 
violated. A serviced request becomes active when 
it is executed by either the synchronizer or, on 
its behalf, by another process. 
A requested operation may be thought of as being 
in one of three states: 
1. Active -— Currently executing. 
2. Enabled -- Can be serviced without 
infringing some constraint. 
3. Disabled --— Cannot be serviced without 
infringing some constraint. 
Two or more processes are said to be in conflict 
if they are simultaneously enabled. Conflict 
resolution occurs when the synchronizer services 
one of the enabled operations, based on a 
specified scheduling discipline or priority. 
The model assumes that 
1. Arrival of a request is synonymous with 
recognition of its presence by the 
synchronizer. 
2. Once an operation is enabled, it will be 
serviced after a finite amount of time, 
unless it is meanwhile disabled by the 


servicing of some other operation, as in the 
ease of conflict resolution in favor of some 
other operation. 


3. There may be a finite delay between 
servicing a request and its subsequent 
activation. The synchronizer services no 


other operation until the serviced operation 
becomes active. 


4, An active process cannot be = aborted 


interrupted by the synchronizer. 


or 


An operation remains active for a finite but 
indefinite period of time, after which it is 
said to have terminated. 

These assumptions are formalized in the next 
section after the introduction of the language 
constructs. They do not introduce any major 
restrictions on the class of synchronization 
problems that can be solved, but are motivated by 
a desire to achieve a suitable abstraction of the 
notion of synchronization. Many specific 
synchronization primitives fit this abstraction. 


WI 


THE SPECIFICATION LANGUAGE 


In our language, specifications are statements in 
first-order predicate calculus augmented with 
temporal operators, as introduced presently. The 
underlying semantics of the language is based on 


a computational model involving the notion of 
events and conditions [10]. In this model, the 
effect of concurrent execution of processes is 


considered to be the enabling and disabling of 
certain conditions during the execution process. 
The choice of conditions reflects those aspects 
of the system of parallel processes in which we 
are interested, viz. synchronized aecess’ to 
Shared resources. Events do not appear in the 
specifications, only conditions do. 


We begin with a description of the primitives 
used in the specification language. 


Language Primitives 

Every pending operation has 
conditions associated with 
following semantics. 


four 
it having 


primitive 
the 


There is request for operation 
'a', This condition becomes’ True 
when a concurrent process requests 
operation ‘a’, 


req(a) a 


start(a) Operation 'a' is 
execute (The permission is 
irrevocable). This condition 
becomes True when the synchronizer 


services request ‘a’. 


permitted to 


now. 
when 


Operation ‘at is executing 
This condition is True 


operation ‘a' is active. 


exec(a) 


Execution of operation 'a' has 
terminated. This condition becomes 
True when operation ‘'a' terminates. 


term(a) 


We refer to each distinct type of operation on a 
Shared resource aS an operation class. All 
operations of a particular type are said to be 
instances of that operation class. In the above 
definitions, 'a' stands for a specific instance 
of a particular operation class. 


We will now introduce the temporal operators 
along with their semantics. These are strongly 
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influenced by [12], [15] aaa al 
[Jc 


To be read ‘always C'. This means, 

condition C will remain true from 

now on, i.e., C is true now and 

throughout the future. 

<>C To be read ‘eventually C!'. This 
means, condition C will eventually 
become true, i.e., C will be true 
sometime in the future. 

A UNTIL B To be read as "A remains true until 

B becomes true", This means, if B 

eventually becomes true, then A 

remains true from now until B 


becomes true; otherwise []JA. 


Statements that do not involve the temporal 
Operators are considered to be about the present, 


or 'now'. In general, statements in the language 
will involve the predicate logic ovis aagek 
Voor), &(and) “(not) and  ~=>(implies) b in 


addition to the temporal logic operators. 


Given below are the axioms and gues ence rules 
that form the temporal logic system C7. A and B 


are arbitrary Temporal Logic formulas. 


Axioms: 

[JA => A & <A & [J[]A 

<>< A => <>A 

[](A => B) => ({]JA => []B) 

A UNTIL B & ~B UNTIL C => A UNTIL C 


[](A => B) <=> []CA => (B UNTIL “A)) 

Inference Rules: 

If A is a valid first-order logic formula 
then j- A. 

If j|- A and j- (A => B) then j- B 

If j;- A then ;— [JA 


Certain temporal operators are derived from these 
primitives, and are introduced to enhance the 
readability of the specification language. They 
are, 


P ONLYIF Q (P => Q) i.e., P is true only if Q 
is True. 
P IFF Q (P => Q) & (Q =P). 


P ONLYAFTER Q (~P UNTIL Q) i.e., P can become 
True only after Q does. 


P AFTER Q [(~P UNTIL Q) & <>P] i.e., P will 
become True after Q. 
(b) the operator precedence is ~~, {<>,[]}, 


{V,&}, UNTIL, followed by =>. 


Cc) the choice of axioms and inference rules 
listed here is based upon their utility in 
subsequent sections. No claim is made for their 
completeness. 


P CAUSES Q (P => <>Q) | 
BAR {(R #4 P) & (R #£ Q) & (P =| 
<>R) & (Q AFTER R)} i.e., P is the 


sole cause for Q to become True. 


where P and Q are arbitrary conditions. 
The following are true for a particular operation 
aa 


req(a) => [req(a) UNTIL exec(a) ] 

start(a) ONLYIF req(a) 

start(a) => [start(a) UNTIL exec(a)] 

start(a) CAUSES [exec(a) & “start(a) & “~req(a)] 
start(a) => [Vbéa ~start(b) UNTIL ~start(a)] 
~exec(a) => [™~exec(a) UNTIL start(a)] 

exec(a) => [exec(a) UNTIL term(a)] 


[term(a) &exec(a)] CAUSES [~term(a) &~exec(a) ] 


These statements are the axioms formalizing the 
synchronization model. 

Using the primitive conditions, we define the 
following: 


there exists a request for 
operation a satisfying ‘cond', 
i.e., jJa(req(a) & cond). 


req(a) [cond] 


req$A Vach req(a), i.e., there exists a 
request of class A. 
exec$A YaceA exec(a), i.e., an operation of 


Class A is active. 


The Specification Statements 


The temporal operators defined earlier serve as 
the building blocks for our specification 
language. The semantics of the various 


specification statements are given in terms of 


these temporal operators. 


While developing our specification language, and 
the synthesis procedure, it will be instructive 
to consider a typical synchronization problem 
encountered in the context of operating systems. 
Although it is a simple example, it serves to 
illustrate the important aspects of the. approach. 


The Limited Resources Problem: {8]. A fixed 
number of similar resources is managed by an 
operating system. User processes acquire a 
resource by executing the operation ‘acquire’, 
and release the resource by the operation 
'release', The variable '‘'free' maintains the 
number of available resources, while ‘'max' gives 


the maximum number of resources 
"Release! is given priority over ‘acquire’. 


Operation variables The specification Language 
possesses features that result in succinct 
specifications. One of these, is the facility to 
refer to a class of operations using a generic 
operation name. Specifications involving this 
operation name apply to all operations in that 
class. 


Specification OPERATIONS a:A; 
apm tee Ne Oe | se 


in the pool. — 
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Semantics VaeA (S); 
Where S is a specification statement involving 
‘a' and applies to. each operation in class A. 


Example OPERATION r: release; 
a: acquire; 
The above specification declares operation 
variables for the limited resources problem, 
where 'r! refers to any ‘'release' operation, and 
‘a' refers to any ‘acquire' operation. 
Resource state Information During the active 


phase of an operation, the "state" of the shared 
resource may be altered. For instance, ‘acquire' 
reduces the number of free resources. Scheduling 
constraints often involve predicates on the 
resource state. For instance, ‘'acquire' can be 
serviced only if there are free resources. The 
above discussion demonstrates the need for 
expressing the synchronization constraints that 
depend on resource state. This language 
facilitates the specification of the following 
aspects of the resource: 
The data structures 
resource state. 


that 


determine the 


Initial resource state. 


The modification to resource state 


operations in each class, and 


by 


Invariance of resource state. 


Example 
RESOURCE STATE INFORMATION 
STATE VARIABLES ARE 

free : integer; 

max : constant integer <- 10; 
INITIALLY 

Free<-max; 

STATE CHANGES 

Acquire: free<-free-1 

Release: free<-—free+1; 
STATE INVARIANCE 

QO < free < max; 


These statements 
information needed 
resources, 


state 
of 10 


resource 
synchronizer 


specify 
for a 


Scheduling Constraints Scheduling 
specifications express the explicit 
under which an operation can be 
example, 


constraint 
conditions 
"serviced", For 


Specification 
“cond 1@req(op name) => 
[l]{Start(op name) ONLYIF cond2}; 
cond3@req(op name) => 
[]{Start(op name) ONLYAFTER cond4}; 


For an operation 'p', condi@req(p) refers to the 
value of cond1 when the request for p arrives. In 
general, cond1 and cond3 are conditions dependent 
on resource state, or arguments to the requested 
operation, or both. In the case of synchronized 
operations, "req(op name)' is a necessary 
constituent of cond2. If the 'cond1@req(op name)! 


clause is not specified, then it is true by 


convention. 


Example 

SCHEDULING CONSTRAINT 
[]{Start(a) ONLYIF Req(a)}; 
[]{Start(r) ONLYIF Req(r)}; 


These specify the requirement that release and 
acquire operations should be serviced only if 
requests exist for them. 


Invariance The invariance specifications express 
the constraints with regard to the resource 
state, in the following manner: 


Specification :; STATE INVARIANCE AL 
Semantics : []I 
Example : STATE INVARIANCE O < free < max; 


Exclusion of Operations In our’ specifications, 
concurrency is assumed to be the rule, and 
exclusion the exception. So when two operations 
are to exclude each other, there has to be a 
specification so stating. 


Exclusion among operations in different classes 


Specification : A EXCLUDES B_ 
Semantics []~(Cexec$A & exec$B) 
A,B€{operation classes 


Exclusion among operations in a class 
Specification : A's EXCLUDE 
Semantics : []~{exec(a1) & exec(a2)} 


V.al,a2€A, A an operation class. 


Total exclusion of all operations 


ee eee 


Specification : EXCLUSION all 

Semantics I EXCLUDES J & I's EXCLUDE 
WI,J€{operation class}. 

Example 


Acquire EXCLUDES Release; 
Acquire's EXCLUDE; 


Release's EXCLUDE; or equivalently, 


EXCLUSION all; 


Priority among Operations We classify priority 
into the following two categories: 
- Priority within requests of a 
operation class, otherwise 
intra class priority. 


particular 
Known as 


- Priority between different operation classes, 

otherwise known as inter class priority. 
In general, both inter class and intra class 
priorities can depend on resource state. This 
dependence can be specified in this language 
through the use of ‘resource state predicates'. 
A ‘resource state predicate' is a “predicate on 
the state of the resource and is said to be True 
if current resource state implies truth of the 
predicate. 


(d) t operation classes} stands for the set of 
operation classes. 


We will see how the priority statements ar: 
specified, and give their temporal semantics. 


Specification: INTRA CLASS PRIORITY 
operation class:- ~ 
resource state predicate: priority rule 

Informal semantics: : 

If "C:- r: expr" is a intra class priority 
specification, then '‘expr' gives the priority 
rule applicable to operations in class C when 
the resource state satisfies 'r'. 
Formal Semantics: 

I = {intra class priority specification} 

OP € {operation class}, 

r e€ {resource state predicate} 

pr_ rule is an arithmetic expression that 

evaluates to an integer. 


(OP :- r : pr_rule)él, Yop,,opsce0P, 
Ci{fr & req(op,) & req(ops) & 
' 1 3 
SEPP | op, < EXP ion.) => 


[Start(op,) ONLYAFTER Start(op5)]} 


where expr} stands for the value of expr 
evaluated in the context of req(a). This 
Specification expresses the requirement that in 
a given class, operations with lower priority 


should start only after all other relevant 
requests with higher priority have started. 

In the absence of an intra class. priority 
statement, order of arrival of requests 


determines the priority of operations in each 
class. This corresponds to an FCFS discipline. 


Specification: INTER CLASS PRIORITY 

resource state predicate 

operation class b > operation class a 

Informal Semantics: 
Given an inter class priority statement 
"r: B > A", if current resource state satisfies 
r, then operations in class B have higher 
priority than those in class A. 
Formal Semantics: 

I = {inter class priority specification} 

r € {res state predicate} 

OP = {operation class} 


Yope ,,ope5€0P, Yr eR, 
(r : ope, > ope,)eI, Yop,sopec,, YORE OPC 5, 
1 1 1 2 
{C]fr & req(op,) & reqlops)] 
=> [Start(op,) ONLYAFTER Start (ops)]} 


As noted earlier, an operation (say p) is 
enabled, if its becoming active will not infringe 
specified scheduling constraints, mutual 
exclusion and invariance. This is written as 
"enabled(p)', and its negation 'disabled(p)'. In 
the semantics above, priority was specified among 
requested operations. However, there are cases 
when only those operations which are enabled are 
to be considered for priority. We refer to this 
as ‘priority among enabled operations'. In such 
cases, an operation can start only after enabled 
operations of higher priority have been serviced. 
Formal semantics in this case is obtained by 
substituting 'enabled(op)' for ‘req(op)' in the 
above specifications. 
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the 


Example In limited resource problem, 
‘release! operations are given higher priority 
than ‘acquire’ operations. Because the number of 


resources is limited, priority based on requests 
may result in a deadlock. Hence we have 


INTER CLASS PRIORITY AMONG ENABLED OPERATIONS 
release > acquire : 


Scheduling Discipline In case more than one 


operation is enabled, the synchronizer resolves 


the conflict using the priority or scheduling 
discipline specifications, and eventually 
services one of the enabled operations. 


Scheduling discipline statements specify "fair" 
behavior of the synchronizer, or in practice, 
what a user construes fairness to mean [16]. 
‘They effectively express behavior of the conflict 
resolution strategy. We say that a scheduler is 
fair if it conforms to the specified scheduling 
discipline. Possible versions of scheduling 
discipline are: 


Scheduling Discipline 0 

An operation that is enabled is serviced, i.e., 
Enabled(op) => <>Start(op) (SDO) 

If '‘op' is such that it can be disabled before 

the synchronizer recognizes that it is enabled, 

then SDO will not be appropriate. 


Scheduling Discipline 1 
If we want to express the fact that enabling of 
an operation causes its start, then we have SD1, 
defined as follows. 

Enabled(op) CAUSES Start(op) (SD1) 
This expresses the direct causality between 
enabling of an operation and its starting. 


Scheduling Discipline 2 

If an operation is going to remain enabled till 

it is serviced, then it will. be serviced, i.e., 
[enabled(op) & (enabled(op) until Start(op))] 


=> <>Start(op) (SD2) 
Scheduling Discipline 3 
If an operation would otherwise be _ enabled 
infinitely often, then the operation is serviced. 
[enabled (op) & ({[]<>enabled(op) } UNTIL 
Start(op))] 
=> <>Start(op) (SD3) 


This will be suitable for operations that are not 
continuously enabled but are repeatedly enabled. 


Scheduling Discipline 4 
This type of fairness is based on the order of 
arrival of requests. The earliest to arrive will 


always be chosen for service. Formally, the 
expression 

ifReq(op.,) AFTER Req(op,)] 

=> [Start(op,) ONLYAFTER Start(op,)]} 

(SD4) 
States that given that Request for OP, arrived 
after that of Op, then OP> can be serviced 
onlyafter Op,. : 
Whenever priority specifications are applied, the 


following variation of SD2 is required. 


316 


Overall Specification 


for the Limited Resources problem. 


{Lenabled(op) & P] & 
(enabled (op) & P) UNTIL Start(op)]} 


=> <>Start (op) (SDe2P ) 


Here P is a condition which holds iff priority 
specification is satisfied. This is necessitated 
by the sequential model assumed for’ the 
synchronizer and the fact that requests originate 
in external processes. 


of the Limited Resources 
Problem Given below is the overall specification 
Note that the 
specification for the problem is obtained by 


conjoining individual specifications. 


SYNCHRONIZER Limited Resources IS 
OPERATION CLASSES acquire,release; 
OPERATIONS a:acquire; r:release; 
SCHEDULING | CONSTRAINT 
Start(a) ONLYIF Req(a); 
Start(r) ONLYIF Req(r); 
RESOURCE STATE INFORMATION 
STATE VARIABLES ARE 
free : integer; 
max : constant integer <- 10; 
INITIALLY 
free<—max; 
STATE CHANGE 
acquire: free<-—free-1; 
release: free<—free+1; 
STATE INVARIANCE 
OQ < free < max; 
EXCLUSION all; 
INTER CLASS PRIORITY AMONG ENABLED OPERATIONS 
release > acquire ; 
SCHEDULING DISCIPLINE SD2P ; 
END limited resources; 


This example illustrates the salient features of 
the language. The fact that each distinct 
property of the limited resources problem was 
specified independent of the rest attests to the 
modularity, and extensibility of specifications 
in the language. Using the top-level constructs, 
we have been able to specify standard 
synchronization problems including different 
versions of readers-writers problems [4], and 
disk-scheduler problems incorporating priority 


[9]. 


THE SYNTHESIS ALGORITHM 


specification of required 
synchronization, we propose an. algorithm which 
derives in stages, synchronization code (in a 
prespecified target language) which will achieve 
the required synchronization. Synthesis is 
achieved by a series of transformations from the 
top-level specifications until a stage is reached 


Given the Top Level 


when statements can be directly translated into 
primitives in the target language. The 
transformation is carried out in a 
target-independent fashion until that stage. We 


pur sue the example of limited resources 
synchronization to exemplify the synthesis steps. 
To keep this presentation managable, only 
transformations required for constructs in the 


example will be discussed here. 


Effecting Resource state changes 


In this step, all changes to resource state by 
the synchronized operations are "mirrored" within 
the synchronizer in the following manner: For 
each resource state variable, a "synchronizer 
variable" local to the synchronizer is created 
with the same type and initial value. The 
synchronizer mirrors a resource state change 
(effected by a serviced operation) by addition of 
statements of the form 

start(op) CAUSES caused action; 
where caused action modifies "synchronizer 
variables'. These modifications correspond to 
resource state changes specified for operation 
roo when executed in exclusion. The 
semantics of caused action is obtained from the 
"Resource state change' statements. All 
specification Statements that involve resource 
state variables are respecified in terms of the 
synchronizer variables. 


(Rule1) 


Example: Retaining the names of the resource 
state variables as in the specifications but 
making them local to the synchronizer, the 


resource state changes will be mirrored by the 
following statements derived using Rule}. 


Start(a) CAUSES (free <- free-1); 
Start(r) CAUSES (free <- free+1); 


Every future resource state modification by a 
serviced operation is faithfully reflected by the 
synchronizer variables. Hence this step is a 
meaning-preserving transformation. 


The next step in the transformation process is to 
derive necessary conditions for servicing a 
request, i.e. starting an operation. These are 
embedded in the Scheduling Constraint, Mutual 
Exclusion, and Resource state Invariance 
Specifications. Deriving the necessary conditions 
from these statements is the subject of the 
following discussion. 


The specification 'A excludes B' is transformed 


into 
Start(a) => “~Exec$B 
Start(b) => ~Exec$A (Rule2a) 


where a is an operation in class A and b in B. 


The case of exclusion of different instances of 


the same operation class A translates to the 
intermediate specification 
Start(a) => ~Exec$A (Rule 2b) 


Since an operation in a class is serviced only if 


(©) since operations that change resource state 
need execute in exclusion, this transformation is 
appropriate. | 


no there are no active operations belonging to 
classes which exclude it, mutual exclusion is 
guaranteed. 


Achieving Resource state invariance 


Example: 


Resource state changes as mirrored within the 
synchronizer are ‘'caused' by the Start of an 
operation. Invariance will be maintained by 


ensuring that the invariant will not be falsified 
by the action. This can be done by deriving a 
precondition (e.g., by the backward substitution 
technique of program verification [14]) for the 


synchronizer action from the invariance 
specification and, the semantics of changes to 
the resource state by the operation. An 
operation is enabled only when the precondition 
is True. For example, if we had the following 
specifications, 


Start(op) ONLYIF Cond (op) 
Start(op) CAUSES caused action(op), 
Invariance INV ~ 


and 


then the transformed specification will be 


Start(op) ONLYIF cond(op) & precond (Rule3) 


Where 'precond' has to be true when start occurs 
in order for the invariant to be true after the 


"caused action". Thus Rule3 preserves specified 
invariance of resource state. 


The precondition for release is derived 
and for acquire it is "free>O", 
arrive at the following 


to be "free<max" 
Using rule3 we 
statements. 


Start(r) => free<max & req(r); 
Start(a) => free>O & req(a); 


When invariance, scheduling constraint and mutual 
exclusion statements have been transformed, we 
can have for each synchronized operation '‘op', 
statements of the form 


<cond1>@req(op) => 
[]{Start(op) ONLYIF necessary condition(op)}; 


<cond3>@req(op) => 
[]{Start(op) ONLYAFTER cond4}; 


Start(op) CAUSES caused _action(op); 


where necessary condition(op) is the conjunction 


of all necessary conditions for ‘'op' to be 
enabled. Scheduling discipline and priority 
statements are inherited from the top-level 
specifications. . 

Example: In the limited resources’- problem, 
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conjoining all necessary conditions for acquire 
and release respectively, we derive the following 


Start(a) => {"Exec$acquire & ~Exec$Release & 
| | free>O & req(a)}; 
Start(r) => {“Exec$acquire & ~Exec$Release & 
| free<max & req(r)}; 
Start(a) CAUSES (free <- free-1); 


Start(r) CAUSES (free <- free+1); 


Priority and scheduling discipline specifications 


are yet to be transformed. 


Transformation of Priority Specifications 
Transformation of priority specifications brings 
in the issue of manifestation of requests within 
the synchronizer. This requires some insight 
into the notion of implementation which 
presently introduced. 


We asSume that the target language possesses an 
abstract data type called 'queue' with primitives 
to enqueue elements onto and dequeue elements 


from them. A queue ‘element! is designated by a 

queue name (say Q) qualified by an ‘element 

index' (say i), as in Q{Li]. 

The attributes of an operation are : 

Op_name Name of the operation. 

op class Operation class. 

nec_cond Conditions necessary for the 
operation to be enabled. 

intracp Priority of the operation within 
its operation class. 

interep Priority of the operation with 
respect to operations in other 
classes. 

Optattr denotes the attribute '‘attr' of operation 

‘op'. From the informal definition of 

‘enabled(op)' given earlier, enabled(op) iff 


op'nec cond. 


Attributes of a queue are: 


op class The class(es) of operations that 
can enqueue onto that queue. 

pr_rule The priority rule that applies to 
all operations in the queue, if one 
such rule exists. 

pr_class Intercp value of the operations in 
the queue if all have the same 
inter class priority. 

len Number of elements in the queue. 


Q'attr refers to attribute ‘attr' 


of the queue 
named 'Q', | 


A queue element corresponds to a request for an 


operation. Thus Q{i] and Q{i]'nec_cond refer, 
respectively, to the operation and necessary 
condition corresponding to the i element in 


Q. Given a queue Q, icQ stands for i€{1..Q'len} 
and OpeQ, iff 3i€Q(QLil=op). 


General Statement of Priority 
Before we translate priority specifications, it 
will be instructive to examine what is meant by 
‘priority in general, and how the specifications 
determine priority among operations. When we say 
that an operation (say b) has higher priority 
than another (say a), 
serviced only after b is serviced, i.e., 


we mean that a can be 


is - 
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[lireq(a) & req(b) & (a'pr < b'pr) 
=> start(a) ONLYAFTER start(b)} 


where op'pr is a pseudo attribute of ‘op! 
computed using op'intercp and op'intracp (as 
shown below). This expression of priority can be 


shown to be equivalent to 


C]{start(b) ONLYIF “~req(a)[a'pr > b'pr]} 


The general semantics of priority is then, 


{]{start(a) ONLYIF 
“req(b)({b'pr > a'pr]} 
[]{start(a) ONLYIF 
~req(b)Lenabled(b) & (b'pr > a'pr)]} 


(P1) 


(P2) 


Where P1 is applicable when priority is specified 


among requested operations and P2 among enabled 
operations. 


Now we will discuss how op'pr is determined for 
any operation op. The inter class priority 
specification "r : ope, > ope," has the following 
semantics : 


YOP,E0pe , ’ YOP5E0PC5, 
{fr & Beas eP ) & req(op.) => 


Ops! interes > op,'intercp} (DEFN1) 
The intra class priority specification 
"ope , :- r : pr_ rule" has the following 
semantics: 
YOP,,0P5 Eope,, 

{Cr & req(op, ) & reg\opa) & 
(prorule! op. > pr a Une hops, )] => 
[op,'intracp > op,'intracp]} (DEFN2) 


These follow directly from the definitions of 
intracp and intercp. Also, 


If (a'intercp > b'interep) then (a'pr > ney 
If atop class b'op class and 
(a'intracp > b'intracp) then (a'pr > b'pr). 
(DEFN3) 


As was noted earlier, since an operation's 
intercp and intracp can vary with resource state 
for any operation '‘'op', op'pr is also dependent 
on resource state. Notice that '>' defines a 
partial ordering among operations. 


Now we proceed with the transformation of 
priority specifications. Transformations will be 
consistent with the specifications if: 


1. There exists a one-to-one mapping from 
requested operations to elements in queues. 
This is ensured by enqueuing each request 
onto its "waiting-queue", 

2. From each queue, always only aé_ certain 
"preferred" operation is serviced. 

3. An operation is serviced only if it is 
enabled. 

4. When an operation is serviced, appropriate 
(specified) actions are caused. 

Systematic transformation rules exist for 


priority specifications. These are based on 


~ The type of priority specified, viz. among. 


enabled or requested operations, or inter 


class or intra class priority, 


- The dependency of pr_ rule on resource state, 


- The behavior of necessary conditions of 
operations, etc. 

Space limitations preclude discussion of the 

details of these transformation rules. Instead, 


the translation required by the limited resource 


problem will be explained in detail. The 
following is true for this problem. 
1. Priority applies among enabled operations 
only. 
2. All acquire (release) operations have the 


same necessary conditions. 


Order of arrival determines the priority 


within each class. 


1) Since intra class priority is not specified, 
and all operations in a class have the _ same 
necessary condition, a queue is designated for 
each operation class and for each queue 'Q', 


[listart(Q{i]) ONLYIF i=1}. 


2) Since priority is specified among enabled 
operations, inter class priority manifests itself 
as follows: 


[]{start(Q{iJ) ONLYIF 
[VQ1(Q1'pr_ class > Q'pr_class) disabled(Q)]}. 


Here disabled(Q) stands for YopeQ{disabled(op) } 
and enabled(q) for ~disabled(q). 


For the limited resources problem, we designate 
‘aq' and 'rq' to serve as the queues for acquire 
and release respectively. The following four 
pairs of statements result. 


req(a) => a¢eaq; 
req(r) => r€rq; 


[]{start(aqfi]) ONLYIF 

iz=i & enabled(aqli]) & disabled(rq)}; 
[}{start(rqfi]) ONLYIF 

i=1 & enabled(rq[i])}; 


start(aqli]) CAUSES (free<-free-1); 
start(rq[lil) CAUSES (free<-free+1); 


enabled(aq{[i]) means [free<max & req(a) & 
~exec$release & ~exec$acquire] 


enabled(rq{i]) means [ 0<free & req(r) & 
~exec$release & ~exec$acquire] 


The above transformations ensure that 


1. A request is enqueued onto a designated 
queue depending on its class and necessary 
conditions. 

2. An element is serviced only if it is enabled 
and there are no requests with higher 
priority. : 

3. Servicing a request causes the required 
action. . 
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Hence the transformation above preserves’ the 


semantics of the top-level specifications. 


From an examination of the possible primitive 
conditions, we observe that since at this stage 
of the transformation, requests are manifest as 
elements in queues, conditions of the form 
req(a)[cond] will have to be expressed in terms 


of conditions on elements in the queues. This is 
achieved by knowing the relationship between 
queues, operations, and their necessary 
conditions. 

Deriving Target Language Code 

In an abstract sense, the synchronizer’ code 


consists of the following types of statements: 

- Enqueuing statements: These indicate how the 
synchronizer responds to the arrival of 
requests. After determining the class of the 
request and the conditions that hold at the 
time of arrival, the synchronizer determines 
the queue onto which a request has to be 


enqueued. Equivalently, the synchronized 
processes may enqueue onto the appropriate 
queue. 

- Servicing statements: These involve’ the 


conditions necessary for a request in a queue 
to be serviced. After determining whether 
these conditions hold, the synchronizer takes 
actions tantamount to servicing a request. 


Causal statements: These indicate the changes 

to resource state, etc, that have to be 

caused after an operation is serviced. After 

servicing a request, the synchronizer effects 

these changes. 

These are the only categories of executable 
Statements that may be found in a _ synchronizer 
other than initialization code. 


The scheduling discipline specification expresses 
the behavior that enabled operations’ should 
possess in order to be serviced. Thus their 
effect will have to be displayed by the choice 


made by the synchronizer in servicing enabled 
operations. They are, in turn, manifest in the 
servicing statements. Space constraints prevent 
detailed analysis of appropriateness of 


scheduling disciplines here. 


We proceed to see how scheduling discipline 
Specification manifests in the synchronization 
code constructed. We assume that Scheduling 


Discipline is independent of resource state. 


If Scheduling Discipline specified is SD0O-SD2, 
then each operation class has a Separate queue. 
For each queue Q, 
[]{Ji enabled(Qli]) & Vj 1<j<i disabled(j) 
=> start(QLi])}. 


If SD3 is specified, 
onto a single queue, 
operation on the queue 
rest, i.e., 
C]{Ji enabled(Qlil) & yj 1<j<i disabled(Q[ jl) 
=> start(Q[i])}. 


all requests are enqueued 
and the first enabled 
is serviced before the 


This transformation is valid. if an enabled 
operation can be disabled only by the 
“synchronizer. 


For SD4, a FCFS scheduling discipline is required 
and hence all requests are enqueued onto a single 
queue. The first element in the queue is always 
serviced once it is enabled, i.e., 
[]{enabled(Q[1]) => start(Q[1])}. 


If all operations in a class have the same 
nec cond then all operations in a queue are 
enabled or all are disabled. This can be used to 
advantage as follows: If Scheduling Discipline 
is specified as SDO-SD2 then each operation class 
has a unique queue. For each queue Q, 
C]{enabled(Q{1]) => start(Q{1])}. 


In situations where priority specifications 
apply, scheduling discipline SD2P will be in 
effect which will be satisfied if the servicing 
was done as in . 

[]{enabled(op) & cond on op => start(op)}, 
where cond_on op was the condition derived from 
priority specifications. 


, 


Realization of a synchronizer 
Limited resources Problem 

The preconditions for servicing operations are 
first simplified thus: 


for the 


[non-empty(aq) & free>O &.~exec$acquire & 
~exec$release & disabled(rq)] => 
[non-empty(aq) & free>O & ~exec$acquire & 
~exec$release & (empty(rq) V free=max)]. 


Mapping the result of the transformations derived 
so far into the primitives of Sentinels, a 
construct introduced in [11], we arrive at the 
following Sentinel implementation of the 
synchronizer. 


Procedure Limited Resources (rq,aq : 
max : constant integer := 10; 
a_count,r count,free : integer; 
a_count:=0; r_count:=0; free:= max; 
Do while (True) 

If non_empty(rq) & free<max & 


queues); 


a_count=0 & 
r_count=0} 


then begin 
detach execute rq{1] count(r_count); 
free := free+1 
end; 


If {non_empty(aq) & 0<free & a_count=0 & 
r_count=0 & [free=max V empty(rq)]} 


then begin 
detach execute aq[1] count(a_ count); 
free := free-1 
| end 
end 3° 


This sentinel program is arrived at after the 
“Ollowing translation from primitive conditions 
in the specification language to primitives in 
sentinels. 7 

Start(qli]) --—> detach execute q[i] 

Exec(op) --> op _count>0 

Req(op) ~-> non_empty(op q) 
cach request has a distinet request queue in the 
sentinel. Concurrent processes will enqueue 
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A language 


acquire requests onto aq, and release requests 


onto rq as per the transformation rule for 
priority. 

Obviously, this sentinel will be- grossly 
inefficient since it is ‘busy waiting' for one of 
the necessary conditions to hold. Applying 
techniques similar to Schmid [18] and Ford [5], 
mor e efficient. code can be generated. For 
instance, when a primitive condition becomes 
true, we need to evaluate only those necessary 


conditions which are "influenced" by it. Means to 
arrive at optimum code is one of the directions 
on which we are currently working. This example 
was presented only to give an example of the 
Structure of the resulting solution. By providing 
rules to translate primitive conditions in the 
specification language to primitives in 
Serializers [1] and Monitors [9], we should be 
able to synthesize code for Serializers and 
Monitors. 


RELATED WORK 


There have been a variety of specification 
languages based on regular expressions [19]. Of 
them, Path Expressions [8] are perhaps the most 
widely referenced. Numerous versions of Path 
Expressions have since been published. Since 
Specifications are in the form of permitted 
Operation sequences, rather than exclusions, 
invariants, etc., contrived path expressions may 
result. Further, the notion of eventuality is not 
expressible in path expressions. 


Synthesis of synchronizers from specifications in 


Grief's language [6] has been described by 
Laventhal [13]. In that approach, properties 
such as exclusion, priority, etc. are engineered 


by suitable ordering specification for some ‘key! 
events pertaining to an operation. The language 
seems to lack the expressiveness to specify 
eventuality and synchronization properties 
dependent on resource state. One of the drawbacks 
of the synthesis algorithm is the necessity to 


consider all possible orderings of event 
expressions contained in the specifications in 
order to determine the set of allowable 
orderings. 


Another work with a goal similar to ours is that 
of Griffiths [7]. The problem description 
supplied to her synthesizer consists of a 
low-level specification of the problem. Calls to 
synchronizing functions surround code that access 
shared resources. Code for the synchronizing 
functions is generated using the assertions that 
precede and immediately follow the calls. Our 
problem description is at a higher level in that 
it specifies the problem and not a solution to 
the problem. 


and implementation for mutual 
exclusion only has been proposed by Brinch Hansen 
and Staunstrup [3]. - 


CONCLUSION 
This presentation reflects some of our current 
thinking with respect to specification languages 
for concurrent systems and certain aspects of 
synthesis of synchronization code for concurrent 


programs. The previous sections elaborated our 
present ideas with respect to the two broad 
goals: Mechanisms for Specification, and 
synthesis of Synchronizers. 

The specification language has constructs for 
stating the set of properties that are normally 
relevant to concurrent operations, namely 
ordering, fairness, priority, exclusion (and by 
default, concurrency), and invariance of resource 


State. Temporal logic provides the framework to 
express their semantics precisely. Some of the 
positive features of this language, are evidenced 
by the example used in the paper. 
An important problem in specifying program 
behavior is whether or not one can verify that 
the specification itself is correct. This problem 
is aggravated by the conceptual gap that normally 
exists between the informal notion of what the 
problem is expected to solve and the formal 
specification technique. Hopefully, the 
specification language we have proposed here will 
help bridging this gap. We believe that the 
approach taken here meets Bloom's criteria [2] 
for a synchronization mechanism to be suitable 
for the construction of well-structured software. 
Development of a specification language that aids 
programmers and at the same time is amenable to 
automatic synthesis of programs has been our 
prime concern here, not a formally complete 
language. 
to the goal of automatically 
for synchronization, we 
algorithm which derives 
synchronization code through successive 
transformation of specification statements. At 
each stage, we gave qualitative reasoning for 
correctness of the translation process. We 
consider noteworthy its ability to synthesize 
Synchronizers with prespecified fairness. The 
issue of formal validation of the synthesis 
algorithm has been explored, but is beyond the 
scope of this paper. 


With respect 
constructing code 
presented an 


An important issue is the practicality of the 
synthesis algorithm. One virtue of the present 
technique is the direct correspondence between 
specifications and their implementation. However, 
this can also lead to construction of inefficient 
programs. We have not approached this problem in 
any systematic way, since derivation of correct 
programs has been our main concern so far. Also, 
not all steps in the synthesis fall under the 
category of pattern directed translation. Rule 3 
for instance, and simplification of necessary 
conditions for servicing operations, require a 
logical simplifier (albeit with limited 
capabilities) built into the system. The 
development herein provides the setting for a 
more general automatic synthesis procedure, which 
demands. additional work before it becomes a 
viable tool. 
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DATA BROADCASTING IN SIMD COMPUTERS” 


ke 
David Nassimi 


University 


summary 


An SIMD (Single Instruction stream, Multiple 
Data stream) computer consists of some number, N, 
of processing elements (PEs). Each PE(i), Osi<N-1, 
has a local memory. The PEs communicate through an 
interconnection network. Three models of SIMD com- 
puters are considered; these models differ only in 
the way the PEs are interconnected [8]: 1) Mesh 
Connected Computer (MCC) with N=n? PEs forming a 
q-dimensional nxnx...xn mesh, Each PE is connected 
to (at most) 2q nearest neighbors. 2) Cube Con- 
nected Computer (CCC) where each PE is connected to 
logN other PEs. 3) Perfect Shuffle Computer (PSC) 
with each PE connected to (at most) three other PEs, 

Let D(i) be a data item contained in PE(i), 
OSisN-1. The data broadcasting problem for SIMD 
computers can be posed in two different ways: 

(i) Random Access Read (RAR) | 

In this formulation, an index S(i) is con- 
tained in PE(i), O<i<N-1. PE(i) is to receive data 
from PE(S(i)). If PE(i) is not to receive data 
from any other PE, then S(i) =o. 

(ii) Random Access Write (RAW) 

Here, an index W(i) is contained in PE(i). 
Data from PE(i) is to be transmitted to PE(W(i)), 
Osi<N. If W(i) =o then data from PE(i) is not 
transmitted to any PE. 

Some applications of RARs and RAWs may be 
found in [4] and [5]. 

The RAR form of the data broadcasting problem 
has been studied by Thompson [6]. He shows that 
any RAR can be performed dy making use of the 
switch settings of a generalized-connection-network 
(GCN) realizing the input-output mapping that cor- 
responds to the RAR. On an nxn MCC, his algorithm 
requires no more than 13n-16 unit-routes (a unit- 
route is a data transfer between PEs that are adja- 
cent in the interconnection network) for any RAR. 
On an N-PE CCC and PSC, his algorithm requires re- 
spectively 4logN-3 and 8logN-7 unit-routes,. 

None of these complexity figures includes the time 
needed to determine the GCN switch settings. If 
this time is included, the complexity of Thompson's 
algorithm is determined by the complexity of the 
GCN set-up algorithm. The best known parallel GCN 
set-up algorithms (Nassimi and Sahni [5]) have com- 


and O(logN) on both 


plexity of O(n) on an nxn MCC, 
CCCs and PSCs with N PEs. 
In this paper, we present an algorithm for the 


RAR problem which runs in 0(q7n) time on a q-dimen- 


sional nxnx...xn MCC, and in 0(log*N) time on an 
N-PE PSC or CCC. Thus, the algorithm of this paper 
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and Sartaj Sahni 


of Minnesota 


is asymptotically faster than the Thompson-Nassimi- 
Sahni algorithm for CCCs and PSCs. For MCCs, we 
expect our algorithm tobe significantly faster than 
the Thompson-Nassimi-Sahni algorithm as the algo- 
rithm of this paper is significantly simpler and 
has much less overhead. 

The RAW problem can be solved using the sub- 
algorithms developed for the RAR problem. Let d 
be the maximum number of data items to be written 
into any one PE. The time complexity of the RAW 


is 6(G-ks dan) on a q-dimensional MCC, and 


O(1log-N + d log N) on an N-PE CCC or PSC. 

RARs and RAWs are performed using certain 
well defined steps. These are described below: 

(i) SORT: Ina sort, records are rearranged 
so as to be in non-decreasing order of a specified 
key. Let G(i) denote the record in PE(i), O<i<N 
Let H(i) be the key field of record G(i). H(i) is 
also in PE(i). Following a sort, the records will 
have been rearranged such that H(i) sH(i+1), 
Osi<N - 1. 

(ii) RANK: The rank of a selected record is 
the number of selected records in PEs with a 
smaller index. For example, assume we have 8 PEs 
each containing one record, Let the key values 
for these 8 records be (6, 4, 2, 2*, 6, 6*, 3*, 4*) 
where an asterisk over a key value denotes a flag 
or selected record. The ranks of the flagged re- 
cords are (-,-,-, 0, -, 1, 2, 3). 

(iii) CONCENTRATE: Let eC, 5 Osrs<j, jsN-l, 


be a set of records with G(i,.) initially in PE (i )e 


Assume that the records haya: been ranked so that. 
H(i) )=r. A concentrate results in record G(i) 


being moved to PE(r), Osre¢j. 

(iv) DISTRIBUTE: Let G(i), O<sisj}<N, bea 
set of records with G(i) initially in PE(i). Let 
H(i), Of is}, be a set of destinations such that 
H(i)<H(itl), Os i<j. The purpose of a distri- 
bute is to route G(i) toPE(H(i)), Osis<j. It is 
easy to see that a distribute is the inverse of 
a concentrate. 

(v) GENERALIZE: A generalize makes multiple 
copies of records. The initial configuration is 
record G(i) in PE(i), Osisj<N. Each record 
has a field H (high). The H values are such that 
OsH(O)<H(1)<...<H(j) sN-1, and H(i) =o for 
j<i<N, Generalize copies record G(i) into PEs 
H(i-1)+1 through H(i), O<isj (we Beene: for 
convenience, H(-1) =0), 

Our RAR algorithm is best described by con-~ 
sidering an example (Figure 1). We have N&8 PEs 
and S(0:7)=(2, 6, 2, ~, 5, 6, », 6). (Recall 
that S(i) specifies the PE from which the data 
for PE(i) is to be fetched, and S(i) =o iff PE(i) 
is to receive no data.) Let T(i) =i and 
FLAG(i)=1, O<i<N. Our RAR algorithm begins by 
sorting the records G(i) = (S(i),T(i),FLAG(i)). 
Records are sorted on S; T is used to resolve ties 


(i.e. records with the same S value are ordered by 
their T value). The sorting algorithm we shall use 
is a comparison sort. We require that during the 
sort whenever a comparison between G(i) and G(j) is 
made, if S(i) =S(j) and T(1)<T(j) then FLAG(L) is 
set to zero. As a result of this, following the 
sort, FLAG(i) =1 only for records with distinct S 
values. For records with the same S value, FLAG=1 
only for the record with highest T value. Lines 
3-4 of Figure 1 give the result of the sort. The 
S values with an asterisk above them correspond to 
records with a FLAG of l. 

The next step is to rank the records with a 
flag of 1. This results in the rank assignment of 
line 5 (Figure 1). For PEs containing a record G 
with FLAG=1, we may define a new record G' where 
G' (i) = (R(i) ,UCi) ,S(i)), RC) is the rank just de- 
termined, U(i) =i and S(i) is as in line 4 of 
Figure 1, The G'(i)s are concentrated to obtain 
the configuration of lines 6 and 7. At this point, 
we define a new record, G", for each PE containing 
a G' type record. G"(i) = (S(i),V(i)) where V(i)=i. 
The newly defined G" type records are distributed 
according to S to get line 8. Observe that now a 
PE contains a G" type record iff its data is to be 
transmitted to another PE. Let D(i) be the data 
in PE(i) that is to be broadcast. The T, U and V 
registers of each PE contain return addresses that 
will now be used to broadcast the data, 

First, the data to be broadcast is concentrat- 
ed using the ranks contained in the V registers 
(line 9). Next, the data is generalized using the 
values in the U registers as the corresponding H 
values in the definition of generalize. This 
yields the configuration of line 10. Finally, the 
broadcast data is sorted using the T value in each 
PE as the sort key. The result (line 11) is that 
data has been broadcast to all PEs requesting data, 
It should be easy to see that the algorithm just 
described provides a correct solution to the RAR 
problem. 

The RAW problem is simpler to handle than the 
RAR problem. When all the W(i)s that are not 
equal to o are distinct, the RAW problem may be 
solved by first sorting the broadcast data into 
non-decreasing order of W(i). This sort is fol- 
lowed by a distribute step in which the data being. 
broadcast is distributed according to the W values. 
When the W(i)s are not distinct, the distribute 
step will not be free of conflict. The conflict 
is to be resolved in a manner that depends on what 
is desired by the RAW. We consider two situations: 

(i) If W(iy) =W(i,) = ee =w(i,) =izo then 


PE(i) is to receive only the data from PE(j) where 
j= min i. 
lsker 

(ii) If W(4,) =W(i,) =... = Wd )=i#e then 
PE(i) is to receive data from all r PEs. 

The first situation is handled by beginning 
with records G(i) = (W(i),T(i),D(i),FLAG(i)), 
O<i< N, where T(i) =i and FLAG(i) =1. Records 
are sorted by W(i) (using T(i) to break ties). 

The sort is similar to the first SORT step of an 
RAR except that FLAG(i) is set to zero if W(i)=W(j) 
and T(i)>T(j). The sort is followed by a ranking 
of the records. Define a new record 

G' (i) = (W(i),D(i)) for each PE containing a record 
with a FLAG of 1. The records G' are next concen- 
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trated using the ranks just computed. Finally, 
the D(i)s of the concentrated records are distri- 
buted using the W fields. Thus, a RAW essentially 


‘corresponds to lines 1 through 8 of Figure 1. 


Situation (ii) of an RAW can be handled in a 
manner similar to situation (i). The details may 
be found in [8]. 
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PACKET COMMUNICATION IN MULTISTAGE SHUFFLE-EXCHANGE NETWORKS i 


Daniel M. Dias and J. Robert Jump 
Department of Electrical Engineering 
Rice University 
Houston, TX 7700% 


Summary 


This paper summarizes research we have done 
on asynchronous packet communication in buffered, 


oY input, Shuffle Exchange Networks with k 
stages, 1 <k <n, (denoted by SEN(2,n,k)) shown 
in fig. 1. In this asynchronous packet communi- 
cation environment the networks can deadlock. 
The research reported here considers deadlock 
detection, recovery and the performance of these 
networks. 


The single stage SEN(2,n,1) [2,3,7] and the 
OMEGA network [4] (which, without "broadcast" at 
switches, is essentially an SEN(2,n,n) without 
the feedback links from stage n to stage 1 ) have 
been studied for their permutation capability. 
The performance of delta networks (which include 
the SEN(2,n,n) with the feedback links from 
stage n to stage 1 deleted) has been studied in 
426 | for a packet communication environment. 
Simulation results and bounds on network perfor- 
mance indicate that a range of performance can be 
obtained by varying the number of stages and size 
of buffers between stages of the network. 


The environment we consider is one in which 
input packets, containing both the data to be 
transferred and the address of the network output 
link to which the data is to be passed, arrive 
asynchronously at SEN input links. The SEN uses 
different bits of this destination address to 
direct a packet as it advances through the stages 
of the network [1-7]. The operation of (2 x 2) 
switches in the SEN is modelled essentially as 
follows [1 | A fixed maximum queue length of 
waiting packets is allowed between stages. kach 
switch handles an input packet at each input link 
Simultaneously. It takes time "t select" to 
determine the successor node to which the packet 
is to be sent. If that output is in use (i.e. 
another input packet is in the process of being 
passed to that output) it waits its turn for the 
use of that output link (with equiprobable selec- 
tion of packets that request simultaneous passage 
through the same switch output link). When the 
selected output link becomes available, it delays 
the data for another time interval "t pass" which 
represents gate delay. At this time the data is 
available at output lines of the link. A wait 
state is now entered (if necessary) until a 
buffer in the selected output queue becomes 
available. 


A packet incident on a switch is said to be 
blocked if it encounters a full buffer at the 
switch output link through which it must pass. A 
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deadlock is said to occur if a set of packets in 


the SEN is permenantly blocked. A necessary and 
sufficient condition for a deadlock is the 
occurrence of a cycle of blocked packets. It can 
be shown that the SEN can recover from a deadlock 
by advancing each packet in a blocked cycle by 
one stage. 


Schemes for the detection of a deadlock have 
been proposed. In these schemes, when a packet 
is blocked, test packets are passed along the 
blocked path to determine if a deadlock has 
occurred. For an SEN(2,n,k) the longest possible 


cycle length is L = (2" xk). Suppose that it 
takes time t_ test for a test packet to pass 
through a switch and suppose that a deadlock is 
caused by a cycle of blocked packets of length nm. 
A deadlock detection scheme has been proposed 
that does not require knowledge of t_test and 


which takes time (t_test.(m + mL - m” -~1)) to 
detect a deadlock. Another proposed scheme 
depends on the knowledge of time t_test and takes 
time (2.t¢ test.L) to detect a deadlock. 


Each cycle of blocked packets must pass 
through a switch in stage 1 of an SEN(2,n,k) 
(fig. 1). Thus, for deadlock recovery, it is suf- 
ficient to have an additional buffer at each 
stage 1 switch, specifically for this purpose. It 
then takes time (t recovery.k) to recover from a 
detected deadlock in an SEN(2,n,k), where 
t recovery is of the same order of magnitude at 
t_ pass. Alternatively, deadlock recovery can be 
speeded up by having a "deadlock recovery buffer" 
at each switch input and simultaneously advancing 
all packets in a detected blocked cycle. The 
deadlock recovery time, t recovery, for this case 
is a constant of the same order of magnitude as 
t pass. 


of SEN(2,n,k), 
These simula- 


Event driven simulations 
1 <k <n, have been performed. 


tions vary the number of input links (2” ), 
stages (k), buffer lengths between stages, 
t select, t_pass, t_test and t_recovery. Simula- 
tion results indicate the following: 


(i) When single stage networks, with one buffer 
between switches, are operated at very high 
input rates, deadlocks occur very often 
(approximately one deadlock for every 3 
packets that enter the network). 

(ii) The frequency of deadlock occurrence can be 
dramatically reduced by 


(a) increasing the number of stages in the 
network. (The SEN (2, n,n) is deadlock 


free), 


(b) increasing the buffer size between 
stages, 

(c) controlling the input rate to the net- 
work. 


(iii) A range of “maximum performance" can be 


eg 


obtained by varying the number of stages in 
the network. The upper limit of performance 
of these networks is comparable to the same 
size crossbar switch Eee 


Typical simulation results for SEN(2,n,5), 
k < 5 are shown in fig. 2. Some of the 


research in progress is as follows: 


(i) The performance of the outlined schemes is 
being compared with the synchronous tech- 
nique in 51. 

(ii) The performance of the networks with other 


input packet flow control strategies is 
being studied. (An example is to restrict 
the number of packets in the network from 
each source). 


(iii) The multiplicity of paths from each network 


[1] 


input link to each network output link (as 
opposed to the unique paths in delta net- 
works (4) can be used to improve SEN relia- 
bility. This aspect is being investigated. 
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PS : denotes the Perfect Shuffle 


permutation 


Fig. 1 A 2" input, k-stage Shuffle Exchange 

Network (SEN(2,n,k)) for 1 <k <n... 
Notation: 
Inter-arrival time: Average interarrival time of packets 
(exponentially distributed) after a buffer at a network 
input link becomes available. 
Thruput: Average number of packets put out by the network 
in unit time. 
maxtp1: Maximum thruput of a network for a given inter- 
arrival time. 
maxtp2: Maximum thruput of a network at any inter-arrival 
time. 


Parameters: 


t_select= 0. t pass= 1.0% t_test= 0.1. t_recovery= y ee 


. m= length of deadlock cycle. L= 32k for an - SEN(2,5,k). 


Deadlock detection time= 0.1(m + mL ~ m“ - 1). 


Fig. 2 Typical simulation results for an SEN(2,5,k), 
Lok <5. 
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Abstract—This paper describes a technique for producing a 
VLSI layout of the shuffle-exchange graph. It is based on the 
layout procedure in [2] which lays out a graph by bisecting the 
graph, recursively laying out the two halves, and then combining 
the two sublayouts. The area of the layout is related to the number 
of edges that must be cut to bisect the graph. 


For the shuffle-exchange graph on n vertices, we present a 
bisection schema for which the above. procedure yields an 
O(n2/1g 1) area layout when n = 2* and k is a power of two. The 
bisection involves a mapping from vertices. of the graph to 
polynomials, and the polynomials are subsequently evaluated at 
complex roots of unity. Incidental to this construction is a result on 
the combinatorial problem of necklace enumeration. 


1. Introduction 


The shuffle-exchange network has been shown to be an impor- 
tant communications structure for parallel processors. Stone [8] 
describes algorithms which use this structure to solve several 
problems, including the computation of the discrete Fourier 
transform and sorting bitonic sequences. ‘The number of communi- 
cations steps required by these algorithms is typically a polynomial 
in the logarithm of the number of nodes in the nctwork, and the 
nodes themselves need only perform relatively simple operations. 


VLSI designers often try to minimize the area used by a circuit 
subject to the requirements imposed by the fabrication technology 
on the minimum feature sizes of the components [5]. In [9] 
Thompson develops lower bounds on the growth of circuit area 
based on graph-theoretic propertics of the communications struc- 
ture. He shows in particular that any layout of the shuffle- 
exchange network on n= 2* vertices must use at least Q(n?/k*) 
area. The arguments for Thompson’s lower bounds are based on 
the minimum bisection width of a graph, which is the least number 
of edges that must be removed to separate the graph into two 
equal-sized subgraphs. 
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N00014-76-C-0370 and N00014-80-C-0236. Charles E. Leiserson 
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The concept of bisection width was extended by Lipton and 
Tarjan [3] to that of a separator theorem for a class of graphs closed 
under the subgraph relation. In essence, a separator theorem for a 
class provides upper bounds on the bisection widths of graphs in 
the class. Separator theorems allow the divide-and-conquer 
paradigm to be exploited in the design efficient algorithms for 
graph manipulation [4]. Recently, Leiserson [2] has used this 
approach to design area-efficient VLSI layouts. 


In this paper a theorem similar to a separator theorem is proven 
for the shuffle-exchange graph on n= 2* vertices. We exhibit a 
dissection that shows how the shuffle-exchange graph may be 
bisected, how the resultant subgraphs may themselves be bisected, 
and so forth. We use this result to construct an O(n2/k) area layout 
for the case when k is a power of two, thereby improving 
Thompson’s upper bound of O(n?/./&). In our proof the vertices 
of the shuffle-exchange graph are mapped to a polynomial space, 
and then the polynomials are mapped to the complex plane. This 
construction also provides an asymptotic result on the combina- 
torial problem of necklace enumeration. 


The next section formalizes the notions of bisection and 
dissection. Section 3 introduces the shuffle-exchange graph and 
describes its relationship to polynomials. In Section 4 we construct 
a bisection of the shuffle-exchange graph whose width is O(n/k), 
and in Section 5 we extend this result to produce a dissection. In 
Section 6, the layout algorithm of [2] is applied to this dissection to 
produce an O(n*/k) area layout for the shuffle-exchange graph. 
Section 7 concludes by comparing this result with other work in the 
field. 


2. Graph Dissection 


In this section, we formalize concepts pertaining to the parti- 
tioning of a graph into smaller graphs by the removal of edges. 


A bisection S of a graph G=(V, E) into graphs G’ =(V’, E’) 
and G"”=(V",E") is a disjoint partition of the vertices 
V=V'’UV" together with a disjoint partition of the edges 
E= E'U E" u Eg such that the cardinalities of V’ and V” differ by 
at most one. The cardinality of Ey is called the width of the 
bisection, and the edges in Fy are said to be removed by the 
bisection. The graphs G’ and G” are called the halves of the 
bisection. 


Of course, any graph can be bisected by removing all its edges, 
but usually we are interested in removing as few edges as possible. 
The minimum bisection width of a graph is the smallest number of 
edges that must be removed to divide an n-vertex graph into a 
[n/2]-vertex graph and a |n/2]-vertex graph. Unfortunately, the 
problem of finding the minimum bisection width of an arbitrary 
graph is NP-complete [1]. 


It is sometimes the case that every graph in a class of graphs can 
be bisected by the same general mechanism. We define a separator 
for a class G of graphs to be a family f of bisections such that 
contains a bisection of every nontrivial graph G in G. Interesting 
separators are those that exhibit the closure property. A separator £ 
for a class of graphs § has this property if for any graph Ge G, the 
halves G’ and G” that are produced by a bisection of G in f are also 
in G. Any separator with the closure property whose associated 
class contains a particular graph G is called a dissection of G. 


A dissection S of G may be thought of as a complete binary tree 
that has G at the root, the halves of G from some bisection in F as its 
sons, and the halves of the halves as grandsons, and so forth to 
trivial graphs at the leaves. If G has n vertices, then the subgraphs 
at level j will have about »/2/ vertices. Although there may be 
other graphs in the class G associated with ¥, at the very least G 
must contain all of the graphs in the tree. 


In [3] Lipton and Tarjan introduce separator theorems which use 
ideas similar to those presented here. In their work, however, the 
discussion is restricted to classes of graphs that are closed under the 
subgraph relation, (A class-G is closed under the subgraph relation 
if every subgraph G’ of a graph Ge G is also an element of G.) We 
have departed from their approach because the results of this paper 
rely on properties of the shuffle-exchange graph that do not hold 
for all of its subgraphs. 


3. The Shuffle-Exchange Graph 


The shuffle -exehange graph on n vertices is defined only when n 
is a power of two. Each vertex of the n= 2* vertices can be 
identified with an element of the Cartesian product 


{0,1} = £0j1 Byg. « «bol bj) € {0,13}. 


Each vertex v € {0, 1}* is incident on an exchange edge (v, e(v)) and 
two shuffle edges (v, o(v)) and (v,07(v)), where e and o are 
permutations defined by 


e(by_y by. oes b, by) = by} by_9 eee by (1-5), (1) 


O(dy-1 Dyn. © Dy Dg) = Bywa By-gs « By by De-r. (2) 


In the literature the vertices are usually identified with integers 
from zero to n-1 represented in binary notation. The shuffle 
permutation o is then the permutation applicd to a deck of n cards 
by a perfect riffle shuffle, in which case o(m) = 2m (mod n~]). 
The exchange permutation e is the permutation that exchanges 
pairs of adjacent elements of the vertex set, so that e(m) = m+1. 


The shuffle-exchange graph is highly structured because of the 
shuffle permutation. From equation (2) we see that o(y) can be 
determined from vy by rotating the indices of v to the left one 
position. The shuffle permutation partitions {0, 1}* into equiv- 
alence classes known as necklaces [7], where two vertices are 
equivalent whenever the indices of one are a cyclic permutation of 
the indices of the other. Since rotation by & positions yields the 
Original vertex, the cardinality of a necklace cannot exceed k. 


The properties that we shall use to dissect the shuffle-exchange 
graph are expressed conveniently in terms of the characteristic 
polynomial, which is defined for a vertex v= b,.,. . . by € {0, 1}* 
as 


dS bx! (3) 


O<jsk-l 


p(x) = 


It should be apparent that p,(2) is precisely the vector v considered 
as a binary number, as discussed above. The following lemma 
shows the relationship between the characteristic polynomial and 
the shuffle and exchange permutations. 


Lemma 1: For all v € {0, 1}, 
Pay) = py (x) + 1, (4) 
Pow(X) = xXp,(x) (mod x*-)), (5) 


where the congruence (5) is taken over the ring Z[x] of polynomials 
with integer coefficients. 


Proof. From the defining equations (1) and (2), 


byx° = (1-b)x® 2 bo i 1, 


| 
Hi 


Py (x) a Pew) 


by (x* = 1). 


by x* = by x° 


xX Py (x) - Pow) 


The lemma follows from the fact that each 5 is cither zero or 
one. 


The cyclic structure of necklaces is exploited in Section 4 to 
bisect the shuffle-exchange graph. This is done in such a way that 
most of the necklaces in the graph are bisected. When the number 
of vertices in a necklace is even, it turns out that the half-necklaces 
also have a cyclic structure. An m-cycle is defined to be an ordered 
sequence (Vo, Vj, - - +» Vm-1) Of m distinct vertices such that for 
j=l,...,m-l, 


Py (x) =X Py, (x) (mod xD). (6) 


The next lemma provides justification for calling such a sequence 
an m-cycle, 


Lemma 2: Let (v9, . . -, Vm_-)) be an m-cycle. Any sequence 
(Vj, 6 6 Vets Yoo + + +> ¥ea) formed by cyclically permuting 
(Vo, - - +s Ve) is also an m-cycle. If dis a divisor of m, then the 
subsequence (v9, . . ., Vg-3) is a d-cycle. 
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Proof, This lemma can be proved by manipulating the congru- 
ence (6) in the definition of an m-cycle. The congruence can be 
iterated to yield 


Py) = x™lp, 6 (mod x~1), 


and since x” = 1 (mod x’~1), it follows that 


x Py (X) = Py (x) (mod x”~1). 


Thus (6) holds between the first and last vertices as well as between 
adjacent vertices, implying that the choice of a first vertex is 
immaterial. To prove the second part of the lemma, observe that 
congruence (6) modulo x”-1 must also hold modulo its divisor 
x?-1, DO 


Congrucnce (5) shows that a necklace of k vertices is a k-cycle. 
Lemma 2 establishes that when k is even, the necklace can be 
bisected to yield two k/2-cycles. 


4. Bisecting the Shuffle-Exchange Graph 


The concepts developed in Section 3 are applied in this section 
to construct a bisection of the shuffle-exchange graph on n= 2* 
vertices. The construction is obtained by evaluating the charac- 
teristic polynomials of the vertices at a complex kth root of unity, 
inducing a mapping from {0,1}* to the complex plane. The 
complex plane is then divided to induce a biscction of the shuffle- 
exchange graph. A corollary of this construction is an asymptotic 
result on the number of necklaces. 


Let w = e?””* be the principal primitive complex kth root of 
unity, and consider the mapping v++ p,(w) from {0,1}* to the 
complex plane. Figure ] graphs the values of p,(w) for k=5. The 
vertices are labeled with p,(2). The solid lines forming pentagons 
concentric about the origin represent shuffle edges, and the 
horizontal dotted arcs represent exchange edges. 


Let us examine this figure in relation to Lemma 1. The 
occurrence of regular k-gons of shuffle edges can be explained by 
congruence (5). Since w is a root of x*~1, this congruence becomes 
the equality pgyy(w) = w p,(w). Thus p,,,(w) is the point obtained 
from p,(w) by a counterclockwise rotation of 27/k radians about 
the origin. The vertices in a necklace are mapped to &k points 
equally spaced on a circle about the origin, unless the entire 
necklace is mapped to the origin. The fact that exchange edges are 
horizontal can be explained by equation (4) in Lemma 1. If vertices 
v and e(y) are incident on an exchange edge, then they are mapped 
to complex numbers that have the same imaginary part and differ 
by one in the real part. | 


The bisection of the shuffle-exchange graph will be achieved by 
partitioning the vertices based on the imaginary part of p,(w), with 
tie-breaking when p,(w) is real. All edges that cross the real line 
will be removed, and it will be shown that there are at most O(n/k) 
of these. This bound is easily shown for edges whose incident 
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Figure 1: The shuffle-exchange graph on 32 = 2° vertices 

mapped to the complex plane by v++ p,(w). Vertices are 

labelled with p,(2). Dotted lines represent exchange edges, 
and solid lines represent shuffle edges. 


vertices are not involved in the tie-breaking. Since there are n 
vertices in the shuffle-exchange graph, there are at most n/k 
regular k-gons of shuffle edges, and each of these k-gons crosses 
the real line twice. Since exchange edges are horizontal, they never 
cross the real line. 


In order to define the bisection formally, we first partition the 
nonzero complex numbers as €* U C™ where 


ct 
(- 


{ze C|Im(z2>0} U {xe R|x>0}, 
{ze C|Im(z)<0} U {xeR[x<O}. 


The halves G’ and G” are defined by the regions to which vertices 
of the shuffle-exchange graph are mapped. The vertices for which 
p,(w) € C* are assigned to V’ and those for which p,(w) € C™ are 
assigned to V”, The remaining vertices, those for which p,(w) = 0, 
are distributed arbitrarily but equally between V’ and V”. Three 
types of edges are placed in E's. 


1. Exchange edges whose incident vertices are mapped to 
real numbers. 

2. Shuffle edges whose incident vertices are mapped to the 
origin. 

3. Shuffle edges between vertices v and v’ such that 
p,(w) € C* and p(w) € C. 


It can be seen by inspection that Fy is a superset of the set of edges 
that connect V’ to V”. Edges not in E's are allocated to E’ or E” 
according as their incident vertices are in V’ or V". 


To see that |V’| = |V’”|, consider for any vertex v the vertex C(y) 
obtained by complementing every index in the vector v. This 
relationship can be restated in terms of characteristic polynomials 
as 


r-—4 


Pew 9) (x14 yk? 4 1 +d) = pl). 


Because the sum of all kth roots of unity is zero, it follows that 
Py(&) = —Pewy(w). Therefore, the correspondence v++ C(y) is a 
one-to-one correspondence between the vertices mapped to C* and 
those mapped to C~. This proves that this partition is a bisection as 
was Claimed. The cardinality of Fy; is the width of the bisection and 
is bounded by the following theorem. 


Theorem 3: For any positive integer k, there is a bisection S-of 
the shuffle-exchange graph on n = 2* vertices such that the width of 
S is at most 6 (n/k). 


Proof. Let S be the bisection described above, and consider the 
three types of edges that compose F's. We will bound each of the 
three types by the quantity 2 (n/k). 


Each of the type 3 edges is a shuffle edge incident on vertices 
mapped to nonzero complex numbers, and each such vertex 
belongs to a necklace of exactly k vertices which are mapped to 
nonzero numbers. Since the total number of vertices in the shuffle- 
exchange graph is n, there can be at most n/k such necklaces. The 
shuffle edges in each of these necklaces form a regular k-gon 
centered at the origin, and thus only two of these edges can cross 
the real line, in the sense of having one incident vertex mapped to 
(* and the other to C~. Thus there can be at most 2(n/k) type 3 
edges. 


The same argument can be used to bound the number of type 1 
edges. There are at most 2(n/k) vertices mapped to nonzero real 
numbers. Since every exchange edge whose incident vertices are 
mapped to real numbers has at least one of these vertices mapped 
to a nonzero real number, there can be no more than 2(n/k) type 1 
edges. 


Finally, the number of type2 edges can be bounded by the 
number of type 1 edges by observing that for each shuffle edge 
(vy, o(v)) whose incident vertices are mapped to the origin, the 
exchange edge (o()), e(o(v))) isa typeledge. O 


We now pause to examine an interesting by-product of these 
counting arguments, a result on the combinatorial problem of 
necklace enumeration. A necklace is a string of k pearls, where 
each pearl may be one of c colors. Two necklaces are considered 
equivalent if one can be rotated to form the other, but not if they 
are only reflections. It is well-known [7] that the number of 
necklaces of k pearls in c colors is 


de ck/d $d), (7) 
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In this formula $(d) is Euler’s totient function, the number of 
positive integers not exceeding d that are relatively prime to d. 
Although it appears that the term for d= 1 in (7) might dominate 
the summation, it is not apparent that the contribution of the other 
terms is insignificant. However, the following corollary to Theorem 
3 shows that this term is asymptotically dominant. 


Corollary: The number of necklaces of {0,1, . . ., c-1}* lies 
between c*/k and ((c+1)/(c-1)) (c*/k). 


Proof The definitions of the o and e permutations may be 
extended to {0, 1, . . ., c-1}*as follows. 
e(b,_) Dy. oe by bo) 


by} Dy-2 ie b, (bo+1 mod c), 


o(by.1 Dy-2 ee by bo) by-9 ss the ae by bo besa 


The characteristic polynomial is defined as before (notice that now 
p,(c) is the vector v considered as a number expressed in base c 
notation), and the argument of Theorem 3 can be adapted to show 
that the function v++ p,(w) maps at most 2c*/(c-l)k elements of 
{0,1, . . .,c-1}* to zero and that the remainder lie in necklaces of 
kelements. O 


5. Dissecting the Shuffle-Exchange Graph 


In the previous section, we presented a bisection of the shuffle- 
exchange graph on n = 2* vertices. In this section we will show that 
when k is even, the structure of the halves is similar to the structure 
of the original shuffle-exchange graph. This similarity is captured 
in the notion of an m-cyclic subgraph of the shuffle-exchange 
graph, and it is shown that the halves are k/2-cyclic subgraphs. 
The bisection from Theorem 3 can be modified to bisect m-cyclic 
subgraphs when m is even. Thus when k is a power of two, this 
approach can be uscd iteratively to construct a complete dissection 
of the shuffle-exchange graph. 


An m-cyclic subgraph is a subgraph of the shuffle-exchange 
graph whose vertices are partitioned into disjoint m-cycles. Vertices 
not appearing in these m-cycles are also allowed, but such vertices 
must be isolated, not incident on any edge in the subgraph. Ifa 
shuffle edge (v, o(v)) appears as an edge of the m-cyclic subgraph, it 
must.occur between adjacent vertices of one of the m-cycles, and 
the exchange edge (o(¥), e(o(v))) must be an edge of the m-cyclic 
subgraph as well. 


The reader should be warned that mr-cyclic subgraphs are 
nothing more than a vehicle for extending the bisection of the 
shuffle-exchange graph to a dissection. The definition has: been 
carefully crafted so that the proof of Theorem 3 will apply to them 
and so that their separator exhibits the closure property. 


Lemma 4: When k is even, the halves G’ and G” produced by 
the bisection from Theorem 3 are k/2-cyclic subgraphs. 


Proof, Without loss of generality, we show this for G’ only. 
The vertices that are mapped to zero by v++ p,(w) have no incident 
edges (are isolated), but every other vertex of G’ occurs in some 
sequence (vo, . . ., Vg2-1) that arose from cutting a necklace of k 
vertices in half. Since any necklace of k vertices is a k-cycle, and 
k/2 divides k, Lemma 2 ensures that this sequence is a k/2-cycle. 
Thus we have demonstrated the first requirement for G’ to be an 
k/2-cyclic subgraph: every vertex not in an m-cycle is isolated. 


We must now show that if a shuffle edge (v, o(v)) appears as an 
edge in G’, then it occurs between adjacent vertices of one of the 
m-cycles, and furthermore, that then the exchange edge 
(o(v), e(o(yv))) is also in G’. It is clear that the first condition is 
satisfied. The second condition can be demonstrated by observing 
that both v and o(y) are mapped to C*. Since the point pg,(w) can 
be obtained from p,(w) by a counterclockwise rotation of 2ar/k < a 
radians about the origin, it is impossible for o(v) to be mapped to 
the real line. The set of removed edges J’, contains only those 
exchange edges whosc incident vertices are mapped to real points, 
which means that (o(y), e(o()j) must bein EY O 


When mm is even, the bisection from Theorem 3 can be 
gencralized to a bisection of an arbitrary m-cyclic subgraph. Let 
W, = e?7"”™ and consider the function v4 p,(w,). Since w,, is a 
root of x”~-1, the congruence (6) between adjacent vertices of 
m-cycles becomes the equality Py (Wm) = Wn Py, (@m ). This means 
that if any vertex of an m-cycle is mapped to a nonzcro complex 


number, all the m vertices of the m-cycle are mapped to distinct. 


points evenly spaced on a circle about the origin. Equation (5) 
applies as before to show that vertices connected by an exchange 
edge are mapped to complex numbers which differ by one. 


Let G be an arbitrary m-cyclic subgraph of a shuffle-exchange 
graph on n = 2* vertices, and suppose that m is even. In order to 
construct a bisection of G, the vertices of the m-cycles of G are 
assigned to V’ or V” according as they are mapped by v++ p,(w,,) 
to C* or C~. The remaining vertices of G are those vertices that are 
mapped to the origin and those that are isolated. These may be 
divided arbitrarily but equally between V’ and V”. As with the 
bisection from Theorem 3, £ consists of three types of edges. 


1. Exchange edges whose incident vertices are mapped to 
real numbers. 

2. Shuffle edges whose incident vertices are mapped to the 
origin. 

3. Shuffle edges between vertices v and vy’ such that 
p,(w,,) € €* and p,'(w,,)€ C. | 


The remaining edges are assigned to E’ or E” depending ‘on 
whether their incident vertices are in V’ or V". 


Unlike before, however, the correspondence v ++ C(y) cannot be 
used to show that |V’|=|V’”|, since v may be a vertex of G when 
C(v) is not. But because m is even, the equality 
Py(@®m) = Py, mem) holds for vertices v; and v;,,/2 in the same 
m-cycle, and the correspondence v++ v,,,,2 suffices to show that 
this partition is a bisection. The following lemma provides a bound 
for the width of the bisection. 
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Lemma 5: Let m be even, and let G be an m-cyclic subgraph on 
vertices. There is a bisection S$ that bisects G into m/2-cyclic 
subgraphs and has width at most 61/m. 


Proof. Let S be the bisection just described. Its width can be 
bounded by showing that there are at most 2//m of each of the 
three types of edges in Ey. This bound holds for type 3 edges 
because there can be at most ¢/m disjoint m-cycles in G and no 
more than two type 3 edges per m-cycle. Since each type 1 edge has 
at least one incident vertex mapped to a nonzero real number, and 
there are at most two such vertices per m-cycle, the bound holds for 
these edges. Finally, for any type2 edge (v,o(v)), the edge 
(o(y), e(a(y))) is a type 1 edge because G is an m-cyclic subgraph. 
Thus there can be no more type 2 edges than type 1 edges, and the 
bound on the width of the bisection is proved. It should be 
remarked here that the definition of m-cyclic subgraphs was 
specifically constructed in order to establish this correspondence 
between type 1 and type 2 edges. 


To prove that the halves of the bisection are m/2-cyclic 
subgraphs, observe that the bisection S isolates those vertices that 
are in m-cycles mapped to the origin, and splits the other m-cycles 
into pairs of m/2-cycles. Since shuffle edges appear only between 
adjacent vertices of m-cycles, this adjacency is preserved in the 
m/2-cycles. The only exchange edges removed by the bisection are 
those whose incident vertices are mapped to real numbers, and 
hence the argument of Lemma 4 can be used to show that if 
(v, o(y)) is in one of the halves, then (o(v), e(o(v))) is also in the 
half. O 


We are now ready to combine this bisection with the bisection 
from Theorem 3 into a dissection of the shuffle-exchange graph on 
n = 2* vertices for the case when k is a power of two. Recall from 
Section 2 that to dissect this graph, we need to find a class of 
subgraphs that has a separator with the closure property. The next 
theorem provides such a class. 


Theorem 6: If k is a power of two, then there is a dissection f,, of 
the shuffle-exchange graph on n=2* vertices such that any 
bisection in f, which bisects an m-vertex graph has width at most 


6n/k ifm> n/k, 


in{m) = (8) 


0 otherwise. 


Proof. Let G, be the class of subgraphs consisting of i) the 
shuffle-exchange graph itself, ii) its k/2/-cyclic subgraphs that have 
n/2/ vertices, for j=1,.. .,(Ig k)-1, and iii) its subgraphs that 
have no edges. Correspondingly, the separator f,, consists of 7) the 
bisection of the shuffle-exchange graph from Theorem 3, ii) the 
bisections of its k/2/-cyclic subgraphs from Lemma 5, and iii) 
arbitrary bisections of the totally disconnected subgraphs. To see 
that the closure property holds for f,, we first observe that the 
halves of the shuffle-exchange graph are k/2-cyclic subgraphs with 
n/2 vertices. For j=1, . . ., (lg )-2, the halves of the k/2/-cyclic 
subgraphs with n/2/ vertices are k/2/*\-cyclic subgraphs with 
n/24*) vertices. When j= (Ig k)-1 the bisection from Lemma 5 


uses the mapping v++ p,(w,) to bisect 2-cyclic subgraphs. Since 
wW>=-1, all vertices are mapped to real numbers, and thus the 
halves consist entirely of isolated vertices. 


The bisection of the shuffle-exchange graph from Theorem 3 
has width 6(n/k). For j=1,...,(1g 4)-1, the bisection from 
Lemma 5 bisects a k/2/-cyclic subgraph of n/2/ vertices with width 
6(n/2/)/(k/2/) = 6(n/k). The totally disconnected graphs can be 
bisected with zero width. O 


6. Laying Out the Shuffle-Exchange Graph 


Given a dissection of an arbitrary graph, the divide-and-conquer 
technique of [2] can produce a VLSI layout whose area is related to 
the bisection widths of the graphs in the dissection. The VLSI 
model used is that of [9], and its important attributes are that wires 
have a minimum width and that only a constant number may cross 
at a point. In this section the results of Section 5 are applied to 
produce an O(n?/lg n) area VLSI layout for an n-vertex shuffle- 
exchange network. 


The technique of [2] constructs a layout for a general graph G by 
first bisecting G and laying out the halves recursively. The halves 
are then placed side-by-side, and the edges that were removed to 
bisect G are routed between the halves. The layout area. can 
therefore be described as a recurrence in the area of the halves and 
the area required to route the edges removed by the bisection. This 
latter quantity is a function of the bisection widths in the dissection 
of G because the length and width of the layout increase by a 
constant amount for each edge routed. 


The particulars of how the area recurrence arises from this 
construction are described more fully in [2]. Some solutions to the 
recurrence are also given in that paper, but the bisection width 
bound f(m) from equation (8) fails to satisfy certain conditions that 
are assumed for those solutions. Therefore, we give the area 
recurrence from [2] without further justification, but present its 
solution in detail. 


Let A,(m) be the area of the layout of an m-vertex graph in the 
dissection of Theorem 6 (thus A,(n) is the area of the original 
shuffle-exchange graph). We express A,(m) in terms of f(m) from 
equation (8). For the initial condition of the area recurrence, A,(1) 
is aconstant, and forl<m<n, 


Am) = [V2A,m72) + fm) 7. (9) 
The recurrence can be solved by taking the square root of both 


sides and then substituting L,(m) for VA(m). For 1<m<n this 
yields | 


LAm) = V2 LAm/2) + ffm). 


Iterating this recurrence and recalling that n = 2*, we have 
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LAn)=f{n) + J25(n/2) + 2f(n/4) +... 
+ V2 *1f(2) + VIA + Van LQ) 
<(6n/k) [1+ V2 +...+ 77 8*] 
+ Vn L,() 
= (6n/k) [V2 @*!-1)/(V7 - 1) 
+ Vn L,(1) 
= O((n/k) Vk) 


(10) 


= O(n/V xk). 


The reason the sum of the powers of 1/2 goes only as far as lg k in 
line (10) is that f{(m) is zero after this point. Since A,(n) is the 
square of L,(n), the area of the layout is O(n?/k). 


This technique has been used in Figure 2 to lay out a shuffle- 
exchange network on 256 vertices. Only one fourth of the layout is 
shown, and the dissection that was used differs slightly from the 
one in Section 5. Instead of removing exchange edges, the arbitrary 
divisions among vertices mapped to zero are chosen so that e(y) is 
in the same component as v, and the two are placed together. 


7. Conclusion 


We have developed an extraordinary amount of machinery in 
order to construct an O(n2/k) area layout for the shuffle-exchange 
graph on n= 2* vertices, and indeed, we have only been able to 
show this upper bound for the case when & is a power of two. It 
may be that this bound holds when & is not a power of two, but we 
have not been able to prove this. For the time being, the best 
general upper bound seems to be Thompson’s O(n2/+/k ) bound. 


In any event, a gap remains between either of these upper 
bounds and the best known lower bound of 2(n2/k?) which is also 
given by Thompson. This lower bound is proved in [9] by showing 
that the minimum bisection width of the shuffle-exchange graph 
must be Q(n/k) and that the area of any graph layout must be at 


Jeast the square of the minimum bisection width of the graph. 


Theorem 3 shows that this 2(n/k) lower bound for bisection of the 
shuffle-exchange graph can be achieved, even though the dissection 
based on this bisection does not achieve the 2(n2/k2) lower bound 
for layout area. This is because the bisection width f (2) does not 
immediately decrease as m decreases from n. It may be that an 
improved lower bound for the layout area will be based on the 
notion of a minimum dissection, where the width of every bisection 
in any dissection can be bounded from below. 


On the other hand, it may be that an O(n?/k7) areca layout does 
exist for the shuffle-exchange graph, as does one for the cube- 
connected-cycles (CCC) network of Preparata and Vuillemin [6]. 
The CCC is the graph that arises from a boolean hypercube of d 
dimensions when each vertex is replaced by a cycle of d vertices. 
Many of the problems that can be solved quickly using the shuffle- 


exchange interconnection can also be solved quickly using the 
CCC. But despite the fact that a smaller layout is known for the 
CCC, descriptions of algorithms for the CCC tend to be more 
complicated. The discovery of an O(n?/k?) area layout for the 
shuffle-exchange graph would therefore favor the shuffle-exchange 
graph as the network of choice and would allow the many 
algorithms already designed for this network to be applied directly 
in optimal VLSI implementations. But until such a layout is found 
—if ever one is found—the CCC will continue to have the edge. 
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In conclusion, we believe that characteristic polynomials provide 
a useful way of viewing the shuffle-exchange network, and we 
believe that this approach goes beyond the particular technical 
results presented here. Characteristic polynomials unveil proper- 
ties of the shuffle-exchange graph that are obscured by the classical 
approach of relating the vertices to integers. We hope that the 
mechanisms we have developed to relate the topology of a 
particular graph to the algebra of polynomials will be exploited 
further. 
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Figure 2: One fourth of a shuffle-exchange network 


335 


[I] 


[2] 


[4] 


References 


M.R. Garey, D.S. Johnson, and L. Stockmeyer, “Some 
simplified polynomial complete problems,” 6th Annual Sym- 
posium on Theory of Computing, ACM, (April, 1974), pp. 
47-63. 


C.E. Leiserson, “Area-efficient graph layouts (for VLSJ),” 
21st Annual Symposium on Foundations of Computer Science, 
IEEE Computer Society, (October, 1980). 


R.J. Lipton and R.E. Tarjan, “A separator theorem for 


planar graphs,” A Conference on Theoretical Computer Sci- 
ence, University of Waterloo, (August, 1977). 


R. J. Lipton and R.E. Tarjan, “Applications of a planar 
separator theorem,” /8th Annual Symposium on Foundations 
of Computer Science, TEEE Computer Society, (October, 
1977), pp. 162-170. 


336 


[5] 


[6] 


[7] 


[8] 


[9] 


C. A. Mead and L.A. Conway. Introduction to VLSI Sys- 
tems, Addison-Wesley, (1980). 


F. P. Preparata and J. Vuillemin. The cube-connected-cycles: 
a versatile network for parallel computation, Technical Report 
356, Institut de Recherche d’Informatique et d’Automatique, 
(June, 1979). 


J. Riordan, An Introduction to Combinatorial Analysis, John 
Wiley & Sons, Inc., (1958). 


H.S. Stone, “Parallel processing with the perfect shuffle,” 
IEEE Transactions on Computers, C-20, 2, (February, 1971), 
pp. 153-161. 


C.D. Thompson, A Complexity Theory for VLSI, Ph.D. 
Thesis, Carnegie-Mellon University Computer Science De- 
partment, (1980). 


TOWARD A GENERALIZATION OF TWO AND THREE-PASS 
MULTISTAGE, BLOCKING INTERCONNECTION NETWORKS 
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Rehovot, Israel 


Abstract 

Blocking, multistage networks can realize only a 
fraction of the N: permutations (interconnections) 
possible. The minimum number of passes required 
to perform arbitrary permutations is an important 
parameter of every network. 


We define four distinct classes of networks capa- 
ble of performing any permutation in two passes, 
the lowest limit possible. These classes stem 
from a generalization of the Baseline network 
(known to be a two-pass network) by three of its 
main properties. Two of the classes are shown to 
be populated and examples of each are given. Whe- 
ther the other two classes are empty is not clear, 
but this question is shown to be linked to another 
open question, namely the possibility of perform- 
ing all permutations in two passes on the shuffle- 
exchange network. Using the lowest known bound 
for the shuffle-exchange, we define two classes 

of three-pass networks and demonstrate the exist- 
ence of many members in each class. Finally, we 
show that some of the better known networks belong 
to the above classes. Beyond the results reported, 
questions and areas for additional research are 
identified. 


I. Introduction 

An important issue in the architecture of SIMD 
arrays is the choice of a flexible connection net- 
work for interprocessor (or processor-memory) com- 
munication. The requirement of cost-effective- 
ness along with high performance led to conside- 
ration of blocking multistage interconnection net- 
works. Such a network of size N (N inputs x N 
outputs) consists of log.N stages, each compris- 


ing N/2 elementary 2 input x 2 output, two-state 
switches. Each stage is preceded or followed by 
a fixed wiring pattern that connects it to the 
adjacent stage or to the outside. Clearly, the 
maximum number of "admissible" permutations (re- 
alizable in a single pass) on such networks is 


| eae 
YN = 2 , a small part of the N! arbitra- 
ry permutations that exist. 


A number of networks have been suggested [1, 4, 
5, 6], each characterized by its set of exactly 


we admissible permutations. For the sake of 
flexibility, it is desirable to realize arbitrary 
permutations on blocking networks, even if this 
requires multiple passes. Wu and Feng jl, 2] 
suggested the Baseline network which is capable 
of realizing arbitrary permutations in two passes, 
the minimum possible. Recently, Parker [8] has 
proven that the shuffle-exchange network can per- 
form arbitrary permuzations in up to three passes, 
though it is not known whether this is the mini- 
mum upper bound. The same results had been shown 
by the authors [3] in two ways: a constructive 
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proof by emulation of Beizer's [9] network, and 
a shorter algebraic proof based on the properties 
of the Baseline network [1, 2]. 


The wide range cf networks proposed, and the 
seemingly unique characteristics of some, suggest 
the need for a general theory of blocking inter- 
connection networks. Siegel [6] and Wu and Feng 
[1, 2] made significant contributions in this 
direction. In view of the two-pass property of 
the Baseline and the three-pass interim upper 
bound of the ‘shuffle exchange, we raise the fol- 
lowing questions: 


(1) Are there other networks, different from 
either the Baseline or the shuffle exchange, 
which can perform arbitrary permutations in 
two or three passes? 


(2) Given a selected subset of up to A permu- 
tations, is there a multistage, blocking net- 
work on which all the given permutations are 
admissible, while any other permutations can 
be realized in two or three passes? 


In this paper we define several classes of two 
and three pass networks and demonstrate the exist- 
ence of some of them. The second question re- 
mains open for the time beinc, but the treatment 
of question one may serve as a framework for addi- 
tional research. 


Notation 
The number of input (output) lines of an inter- 


connection network is denoted by N=2”, where m 
is a positive integer. The input (and output) 

lines are numbered sequentially from O to N-i. 
This line number or address is denoted by a small 


letter, a, b, c € {0,1,...,N-1} , “hose Vinary 
expansion is given by (Aam-1/8m-2,--+78Q) and whc-e 
m-1 
value is a= 2 a.2” 
i 
1=0 


Permutations (interconnections) are designated by 
small letters p,q,s, where p(a) =b is a per- 
mutation that connects input line a to output 
line b. Superscripts indicate repetitive appli- 
cation of the permutation, the superscript -l 
denoting the inverse permutation. Specific per- 
mutations used in the paper are defined below. 


(1) Identity: 
(2) Bit reversal: 


e(a) =a 
0 (a) (Agiayreser8 5144) 


(3) Bit reversal excluding ay : 


r(a) = (Ay Agrees rAy_y Ag) 
(4) Perfect shuffle: o(a) = 
(a a 


m-2’ m—3/ °° 77914974, 4) 


A network and its set of admissible permutations 


will be denoted by a capital letter X,Y. 
tions on networks are defined below. 


Opera- 


xy = {pq | pe X, qe Y} 


xte{p|ptex} 


sx = {sp | p « x} and xs = {ps | p ec x} 


The group of all possible permutations on the in- 
tegers {0,l1,...,N-l1} will be designated by S. 
Two specific networks of central interest in this 
paper are: 


(1) The Baseline or Reverse Exchange network, B, 
is defined in [1]. A sketch of B for N=8 
is given in Fig. l. 


(2) The Shuffle Exchange network, %, in which 
the wiring pattern preceding each stage is 


described by 0. 


II. Two-pass networks 


The Baseline interconnection network, B, in- 
troduced by Wu and Feng [1, 2] exhibits some in- 
teresting properties. 

(a) BB = S, indicating that any arbitrary per- 
mutation can be performed in two passes 
through the network. 

(2) B=B, 


(c) 


that is, Pye B implies pL é B. 


(e) is not 
cannot contain 


e £B , the identity permutation 
admissible on B, therefore B 
any subgroup of S. 


Of these properties the first one has particular 
significance from a practical point of view. Two 
questions are likely to arise with respect to the 
Baseline network and its characterisation by the 
above properties: 


(a) Is the Baseline network unique among block- 
ing multistage networks, or are there other 
different networks capable of performing any 
arbitrary permutation in two passes? 


(b) Does the ability to perform any permutation 
in two passes imply either or both of the 


remaining properties? 


In order to answer these questions, let us postu- 
late the existence of four classes of "Baseline- 
like" networks, Bo through B. - X will be 


used to represent a connection network, as well 
as the set of admissible permutations on this 
network. : 


£58 {x | xx=S; x= xt. gx} 

8, = {x | xx =sS; X= xt, eex} 

B, = {x | xx=s; x#x 1; eg¢x} 

B,." 1x | xx =s; x# xt eex} 
It is easily seen that these subsets are distinct: 
for every i# 34; i,j €[0,1,2,3] BAB. = g 


(the empty set). 
Bg networks 


Let X represent a connection network topo- 


logically equivalent to B. Applying the defi- 
nition of topological equivalence introduced by 
Wu and Feng, this proposition implies that 


X = p,Bp, or alternatively B= Py X P, 


where Pj’ Py € S). 


Let us concentrate on a particular subset of 


the networks topologically equivalent to B; we 
designate this subset as 
* -l 
Bo ={x | x= P BP, i Py € s} 
Lemma 1 £5 is a subset of Bo: BoS Bo 
Proof For every X € Bg 
a -l,-1 _ “1-1 _ tt 
(1) X = (pjBp, ) = PyB Py = P\BP, = xX 


-1 -1 -1 
(2) XX = (Pp, Bp, ) (P| BP) ) = P, BBP, = 5S 
1 ' 
XP) 
e € B’ which 


(3) Suppose that ee X , since B= D 
this implies that P| eP, = 
leads to a contradiction, therefore 


egX. Q.E.D. 


Obviously Be Bo (Pp, in this case), 


therefore 89 is not an empty set. Furthermore, 


it is easily shown that 89 contains additional 


elements. Let us assume the opposite of this 


proposition. This would imply that for every 


1 


Pi € S P BP, = B. Recalling the definition 


of a normal subgroup [7], the above means that B 
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is a normal subgroup of S and this leads toa 
contradition, since (due to the fact that e ¢ B) 
B cannot be a subgroup of S.Several examples of 


Bo 


class networks are described in Fig. l. 


B, networks 


In [3] we have shown, that any arbitrary per- 
Mutation p eS , can be decomposed into the form 
P\YP, =P. where PyrPo € Q (i.e. they are ad- 


missible on the shuffle-exchange network) and the 


permutation r (designated as R on) in [3] is 
defined as, 

r(a,ipreesrag) = (A) Agree eA AG) ‘ 
Obviously r-=r Alternatively we can state: 


S=Q2rqQ. Let E represent the network 9r .. 
Lemma 2: 


ais Bo 


Qrr (QcQ) x S 


E # Et 


(1) EE 


(2) 


Proof: 
is proven by an example 


given in Fig. 2. We express E and 


et in terms of B, using the iden- 
tities given by Wu and Feng in [1], 
1 


namely: 2 = Bp and Q- = PB , to- 
gether with the identity r=op. 
Hence E = Bo and gt = OB. 


(3) 


e fg E , because if we assume the oppo- 
site, then the solution of the equa- 


tion e=qr,gqeh yields q=r, 


but it can be shown from Lawrie's 


theorem 2 in [4] that r ¢2. OQ.E.D. 


Similarly, we can show that Ete B., : 


n 


Let us now define set B., as: 


-1 -1 -1 
B., tx | X = P,EP) = Pp, Bo P) i Pye S }. 


a 


Using the same method as was applied to Bo above, 


it can be shown that B, is a non-empty subset 


of Bo . Several examples of B-type networks 


are described in Fig. 3. 


‘By and 


B3 networks 

So far, no networks of these types have been 
identified, but neither has the possibility of 
their existence been disproved. Interestingly 
this so far undecided question is related to 
another unresolved problem: can any arbitrary 
permutation be performed in two passes on the 
shuffle-exchange network? The linkage between 
these problems arises from the following specu- 
lation. 


If it is possible to perform all the permuta- 
tions in two passes on the shuffle-exchange net- 


work, that is 2Q= S , then Qe« B, , since 
(1) 22 =S8 

(2) 2 #2 

(3) @€ Q 

Furthermore, all networks defined as p12p1- 


would belong to B.: 
1: 


-1 ~1, _ a 
(1) (p 2p, ) (pj LP, ) = PNP, = § 


-l.-1 _ -1-l -1 
(2) (P, 2p, ) = pj2 Py 7 PLP, 


(3) @e P,2P) , because e = P,eP] . 
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III. Three-pass Networks 


It has been shown [3, 8] that the shuffle- 
exchange network can perform any arbitrary permu- 
tation in three passes. Following the generali- 
zation of the two-pass property of the Baseline 
network, the next logical step is to search for 
a new class of networks capable of performing 
any permutation in three passes. Let us desig- 


nate this class as T 


t = {x| xxx = g} 
Parker [8] showed that S = 202 , and 
0 = Wo WwW] so that W110. EQ and W482 EQ ° 
In [3] we had given a different factorization of 
ps p= rv; so that rik, e Q and QQ r, = QO . 
Likewise, we also showed in [3] that S = QrN F 
where r = ror) such that rj1T, ae Y and 
2 = i 
ro Q The permutations Yjr Yor f5, ¥, are 
defined below using a set of functions of the 
form b, = (arbi pba ogress By yyy reer bo) 
where i= 0,1,...,m-l . Also, k = (m-1)div 2 
and £2 =m div 2, where div represents integer 
division. 
Ti' fa.@a if m even and i > k+l 
1 mi 
or m odd and i>k 
b.=]}a.9b . if O<i<k 
a i -m-i 
a, bi en eee 6 
i 
or i=k+l and m is even 
Yr: 
2 a.®a ; if m even and i>k+l 
1 m-t ; 
— or m odd and i>k 
i 
as otherwise 
rs 
3 
paige st if m is even and i22 
or m is odd and L2Q2 +1 
b.= 
i 
a.@b ij.) if i<t 
a. if m is odd and i=2% 
r : J & 
4 sa.@a_. if m is even and i22 
i m-i-l . 
or m is odd and i2%+1 
b= 
1 a; otherwise. 


Pe Q} 


and Lae is worth some 


The subset of permutation re) = {p | Qp=2 ; 


including permutation ro 


attention. 


< 
Lemma 3: is a subgroup of S. 


Proof: (1) 8 S 8 , therefore it is finite. 


< 
(2) Let PirPo ae. 


Q(p P,) = (Qp,)P, = Mp, = 2 
one (P,P) € a, in other words, 
@ is closed under multiplication. 
Q.E.D. 
(It is possible to show the existence of a simi- 
lar subgroup % = {p | pQ = 2; p © Qh; for 
example, Parker's 


> 
of Q . 


W1 permutation is an element 
<> 


Ps < > 
Furthermore, 2 = 2 N 2 is also a 


non-empty, non-trivial subgroup.) 


The property associated with permutations in 
§ we call "right invariance", while Q is called 
the "right invariant subgroup". The scope of 
right invariance may be extended beyond the 


shuffle-exchange network. 


Corollary 1.Any connection network xX fulfilling 
the following two conditions has an associated 
(2) 


there exists at least one permutation Pp, so 


<< 
right invariant subgroup, xX: (1) e € X; 


that Py e X and XP) = X. 


“~ 


Let Tg represent a set of networks topologi- 


cally equivalent to the Baseline network B , de- 


fined by: 


to = {X| X =P Bp, + PLP, = PoP, = et. 


“a 


Tg networks 

Let X e€ To , then X= P| Bp. where P1Po = 
PoP) =o . We can express B in terms of X : 
B= ip is - Since BB =S , - we obtain 
XOX = S&S. (A similar expression, S = Np” 


led [3, 8] to the conclusion that the shuffle- 


exchange is a three-pass network.) 


Lemma 4. All networks belonging to To are three- 


pass networks, that is tT9€T. 


Proof. We shall prove that p can be decomposed 
into a product of two permutations in the from 


ge X and xXr' 


ri so that r! 4 


eee ae 31% 4 


4 € A « 
(1) Using the relationship between the Baseline 
and shuffle-exchange networks [1] we can modify 


the definition of X to X= PyNpP2 - Recalling 
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the decomposition of o for the shuffle-exchange 


network (9 = TA where Yar ¥, € QQ 
= ; ‘ 1 
Oxy @) we define: ry PyXyPPo and 
' — M4 | ' 
r, P1T3PP> Obviously Kart, € x . 


(2) rir! = 


4X3 (P)X4PP5) (Py X5PP,) 


(p, C(x, (0 (poP,))¥3)P) Po) = 9 


(3) Let q' be an arbitrary permutation so that 


q'eX . For every such g' there exists a 


permutation gq such that q' = P, PP 


' — —_ 
Q'r, = (P, GPP.) (P|r,PP,)= P, (ax,) PPL 


Since gry e 2 for any q_ (because rm, € Q F 


hence qgq'r! 


a & X for every g' eX. 


OyFcDy 


an 


T) networks 


a 


The network E = Qr «€ £8, 


is a two-pass network. 


a 


We now define a new set Ty 


topologically equivalent to E 


containing networks 
(as well as to 


B, since E = Bor): 


tL = {x | x = PEP); PyP. = PoP, = 1} 


-i_ -1 
E =p, Xp, - 


Ss. 


From this definition, Since 


EE = S , we conclude that xXrx = 


A 


Lemma 5. All the networks belonging to T 1 are 


three-pass networks: Tj]©T °* 


Proof. As indicated earlier for the shuffle- 


exchange, r can be decomposed into a product 


of two permutations in the form r=r ry where 


2 
Loi, € Q and Or, =Q. 
Following the proof of Lemma 4, it can be shown 
that the permutations ro = PiTorP. and 


Lor = 2G 


t= j ° 
ry PT Po have the properties al 


roils eX ; Xr 5 ¢ X, This proves the lemma. 


Q.E.D. 


It is easily shown that the shuffle-exchange 


network belongs to both T9 and T] : 


A 


Q = e(Qo)eBp= eet0 , therefore Qe 19 
Q = e(Qr)r = eEr , hence Q2e¢€ 1]. 
Hence to N t1 # § ; whether to = T] seems to 


be a more difficult question. Other networks 


“aA 


belonging to Tg may be found by the following 


lemma. 


Aa a 


Lemma 6. X ET) implies xX ETQ 


nw 


Proof. X ET) implies xX = P, BP. where 


-]_-1 -1 


~l1 -1 -~l1 -1 -j = 
= x 
Po Py Py Py po, therefore E To 


Hence xX. = Pp, BP, - But 


Q.E.D. 
Some examples of three-pass networks are given in 
Fig. 4. The equations P1P. = PoP} = 9 and 
PiP2 ~ PoP 


can be transformed to a more useful form, 


=r are worth some attention. They 


ae P and na ‘Sae* r 
Let cf and G.. represent the set of solu- 
tions to these equations respectively: 
c = {p, |p.op, =0; Pp, €$} 
p 2 2°52 2 
cae 
G = {p, |p,rP, =x ip, SI 


Lemma 7. & and G.. are subgroups in S. 


Proof. (1) Gy and G.. are finite, since 
G&sg and «GS&s. 
p r = 
a Let Gy 1d eG, - Then QP ay = 0 
and q.pq oid Oa 
2 °2 
Hence ( de ( a = ( =) ts 


By similar treatment of G.. we conclude that 


both se and G. are closed with respect to 


multiplication. Q.E.D. 


We can establish a lower bound for the number 
of elements in these subgroups as follows: 
The permutation po acts as the identity permuta- 


tion on those elements whose binary address is 


symmetric (e.g. 01011010 +> 01011010). Fora 
network of size N = 2 there are ky elements 
with symmetric addresses, where 
m 
2 . ; 
2 if m is even 
Ko = | m+l 
2 if m is odd 


Therefore there are (Ko): permutations which 


act on the elements with symmetric addresses only, 


leaving the other elements undisturbed. It should 
be obvious that for any Po belonging to these 
(ko): permutations, PoP =O Py i hence 


ot a. s. 


Similarly, for Py € G, 


2 if m is even 


if m is odd 


while the number of possible permutations is 


(k): . 


Another interesting question is how different 


are two unequal tg networks? More precisely, 


let X, Y € to and X # Y, what can we say about 


D, the number of permutations in the set 


{p | peX and pgy} . 
Lemma 8. For two unequal networks of size N= o 


j N 2 
D is not less than Ag ry ‘ 


Proof. Since X # Y , there exists at least one 
connection of two input lines to two output lines 
which cannot be realized on Y , but can be reali- 
zed on X. Naturally Y cannot realize any 
permutation which includes this connection. On 
the other hand, implementation of this specific 
connection of pairs on xX requires the setting 
of no more than two exchange boxes (out of N/2) 
The maximum number of 


in each of the m_ stages. 


exchange boxes involved is therefore 2m , leaving 


mN 
the other 272M free. Therefore X can re- 


mN 
; “Te FN 402 : 
alize at least 2 ' 2m) = VN /N permutations 
each of which contains the connection-pair which 


cannot be realized on Y. QO.E.D. 


Before we conclude our remarks on three-pass 
networks, let us return to right-invariant permu- 
tations. So far, we only made use of selected 
right-invariant permutations in proving the exis- 
tence of classes - and an . The practical 
Sianieteande of right- anf left-invariance lies 
in the fact, that the permutations having this 


property can be used to characterize classes of 


<— 
permuations performable in two passes. If xX is 


the subset of right-invariant permutations ona 
network xX , then all the permutations in the set 
XXXUXXX can be performed in two passes. For 

e2) - 


B = Q9 = Qr = Qr (because fr 


43 3 4 
the shuffle-exchange network can perform in two 


example: 


passes all the permutations admissible on the 
Baseline. In addition right- and left-invariance 
may be used to recognize or prove admissibility 
of a given permutation by its possible decomposi- 
tion using the identity X = xx - We proceed 
with some additional lemmas concerning right-inva- 
riance. Identical statements hold for left-inva- 


riance. 


Lemma 9. The set of permutations admissible on a 
network belonging to Tg Or Tj], contains a right- 


invariant permutation subgroup. 


Proof. The existence of a single right-invariant 


permutation for each network type has already 


been demonstrated: xr, in Tq and ry in 7]. 


In order to satisfy the conditions stated in Co- 


rollary 1, we have to show, that all Tg and Tj 
type networks contain the identity permutation. 
Let Xetg , hence X = P MPP. , where 


Assume ge such that 


PP, = PP, = 0- Bae 
P PPPs a = 3 Then q = P) P,P =e € 2. Similar- 


ly for Xet, . Q.E.D. 


Lemma 10. The right-invariant permutation sub- 
groups of all networks in Tg are isomorphic to 


each other. 


“a 


Proof. Let X,Y ¢ To ,where X = PMP, ; 


P\P. = PoP) = P and Y= gq, “ea, ; 


Tyo = OTT = /— 


Since we are dealing with networks of the same 


size, there is a one-to-one mapping between X 


-l1 -1 
and Y¥: Y = F(X) = QP} XP, qo : 
Let $1818» e X. Let s',S1 1S} ée Y, defined as 
' = ' — t= 
s F(s) , at F(s,) 1 Sy F(s,) - To prove 


3 ee < ; << 
that XY (xX is isomorphic to Y ) we must 
. < 
show that F(s,) ,F(s,)e¥ for any S478, € X 


F(s,)F(s,) = F(s,8,) wae 


(1) sts! = = -l -1 

1 4 P1SP2999)P] $]P2 Gp = 

“1 ss ) acs a 

GP) SSS)/Po do 

because Te a = “1, Se & 
Po G59, P) Po Py 

The above is true for any s‘ e Y, therefore, 
Since Xs, = X we conclude that Ys! = Y or 


1 1 


s'e Y 
1 e 


(2) Similarly, we can show that So eY. 


ce ~ -= -1 -1l a en 
(3). oan F(s,)F(s5) = 4)P) $)P> 459,P] S2P> dp = 
~l -1 
GP, (S)8,)P, qd, = F(s,s,) - Q.E.D. 
By the same method we can prove: 
Lemma 11. The right-invariant permutation sub- 
groups of networks in TY 


are isomorphic to 


each other. 


A 


Since Metg and Met, , it follows directly 
from the last two lemmas that there is the same 
number of right-invariant permutations in any net- 


work belonging tO T) or Tj]. 


IV. Summary and suggestions for further research. 


By generalizing the Baseline network, four di- 
stinct subclasses of two-pass interconnection 
networks were defined. The existence of many 
different networks in two of these subclasses was 
proven and exemplified. It was also shown that 
many different networks exist capable of perform- 
ing arbitrary permutations in no more than three 
passes, thereby generalizing the property that had 
been specifically proven [3, 8] for the shuffle- 
exchange network. Table 1 below shows that some 
of the most widely known multistage blocking net- 


works are three pass networks. 


We should emphasise several points about the 
class of three-pass netwroks. When performing 
arbitrary permutations in three passes, the middle 
(second) pass realizes a constant permutation spe- 
cific to the network used but independent of the 
overall permutation being implemented. Hence the 
number of variable (permutation dependent) control 
bits is equal to that of two-pass networks. Fur- 


thermore, a large set of permutations can be 


implemented in less than three passes. Finally, 
it should be remembered that any of the three-pass 
networks (including the shuffle-exchange) may turn 
out to be two-pass, since it was defined as capa- 
ble of performing an arbitrary permutation in no 


more than three passes. 


Table 1. Classification of some well-known net- 


works under individual switch control under individual switch control. 


Network Subclass | Number of 
to B or E passes 


- 


| 9, Shuffle- | Shuffle- eBOp , eEY meer T] 
exchane 
ia 


(Omega) 
oll , inverse 
Indirect 
Binary 
n-Cube 


F , Inverse 
Flip net- 
work 


2 ~, Inverse 
SE (Inver- 


se Omega) 


Indirec oBe TQ 
Binary 
n-Cube 
Flip oBe TO 


network 


The methods used in this paper and the lemmas 
proven lead to some additional interrelated ques- 
tions: 

(1) How many different networks are there in each 


an “an n 


of the subclasses 80,81,Tg and Ty ? 


(2) Are there two or three-pass netwoks not topo- 
logically equivalent to the Baseline network? 
(3) Is every permutation admissible on some two or 
three-pass network? 

(4) Is it possible to synthesise a two or three- 
pass network of size N which admits an arbi- 
trary subset of S not exceeding AN per- 
mutations? 

(5) Are there iterative (single-stage recirculat- 
ing) networks other than 2 and ars capable 
of performing arbitrary permutations in two 


or three passes? 
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Summary 


Many problems in the domain of Artificial Intelli- 
gence (A.I.) require great computation power and 
are suitable for parallel processing {4 ]. This 
paper presents a modification of Hewitt's Actor 
System [¢-*], oriented to these problems, in parti- 
cular processing of data from the real world, such 
as continuous speech or visual images. In these ca- 
ses various sources of errors affect the input data 
and also the a-priori knowledge is ambiguous and 
uncertain. 

The described here model takes into account these 

peculiarities, devoting particular attention to 

the flow of messages and the scheduling philosophy. 

In the last decade many efforts have been made to- 

ward the application of parallel processing to this 

kind of problems;the use of traditional programm- 
ing techniques, which do not provide distributed 
and non-deterministic control structures, leads to 
complex implementations, unsuitable for formal des- 
cription and rich of ad hoc solutions. 

The approach we propose is based on the analysis 

of the specific characteristics of the said class 

of problems, in particular: 

- Uncertajnty and great number of data to be pro- 
cessed. 

~ Intrinsic concurrent nature of the decision al- 
gorithms. 

- Complexity of the control structure and therefo- 
re need of introducing non-deterministic construct. 

- Type of computations to be executed, simple but 
very frequent. 

- Natural structuration of the data base and of the 
informations utilized by the algorithms, which 
are suitable to be distributed. 

Last point, i.e. data base distribution, is one of 

the most relevant factor in order to obtain a high 

degree of parallelism (4.5]. The interest for paral- 
lel processing [4,5Jis then quite obvious: many con- 
current tasks, performed by asynchronous processors, 
will hopefully speed up the search of a solution. 

In fact many alternatives may be followed at the 

same time and results evaluated and compared. An 

heuristic search strategy selects a subset of pos- 
sible alternatives, by taking into account informa- 
tion of various kind about the problem(4]. This in- 
formation can be described by means of Knowledge 

Sources (KSs), which will help in solving the pro- 

blem. To this aim, the KSs are hierarchically lin- 

ked together in such a way that the search for a 
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solution may be performed at different levels of 
analysis. Each KS can work at a given level, by 
utilizing the results obtained by those working at 
lower levels.The KSs cooperate to the emission of 
hypotheses about solutions,according to the para- 
digm "Hypothesize and Test", i.e. each KS can emit 
a partial hypothesis, which will then be verified 
by the KS itself or by othersli]. This hypothesis 
emission can be activated whether bottom-up (data 
driven) or top-down (model driven); bottom-up sti- 
mulation occurs when a nucleus of a hypothesis is 
drawn directly from the data; top-down invocation 
occurs when a KS calls another one at a lower le- 
vel to verify a part of the hypothesis; each KS can 
also predict some part of the input data, when com- 
prehension was not satisfactory. Therefore in such 
a system we have a continuous flow of information 
in the two directions. Moreover the relations bet- 
ween the elements contained in a KS (i.e. ‘concepts’ 
in a semantic network) are expressed by means of 
AND/OR graphs. This KS organisation and hypothesis 
formation process can be favorably implemented by 
means of non deterministic constructs &$]. In an OR 
node of an AND/OR graph, for example,progress can 
be evaluated on one basis of the first satisfacto- 
ry verification without waiting for all the other 
components. 

Various models of compvtation, based on the idea of 
communicating processes, have been recently propo- 
sed [69,0]; in particular Hewitt's Actor System 

has been developed as a general tool for modelling 
A.I. control strategies.The model we propose deri- 
ves from the Actor System and has been designed ta- 
king into account the particular class of problems 
described.Fundamental objects of the Actor System 
are the Actors, potentially active pieces of know~ 
ledge, communicating among themselves by means of 
messages. In Hewitt's model, messages are also ac- 
tors, but here we will refer to actor-messages sim- 
ply as to messages.Messages contain data structures 
and possibly descriptions of other actors to be cre- 
ated. The receipt of a message by an actor is an 
Event; this activates the actor itself, which in 
turn processes the message, updating its local know- 
ledge; moreover it may send new messages and even- 
tually create new actors. An actor activation must 
always terminate; any other message arriving during 
this phase, must wait for the end of the current ac- 
tivation. Interference between messages is resolved 
by a fair arbiter. 
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As the activation of an actor can depend upon the 
random sequence of its events the Actor System in- 
cludes a potential non-determinism (within the ac- 
tor). This non-determinism can be favorably exploi- 
ted by the control strategies previously described. 
Furthermore, the Actor System shows a dynamic and 
flexible structure, in that actors can be created 
and then removed from the system, when no longer ne- 
eded. This last feature is fundamental for us, be- 
cause it is impossible to have all the possible ins- 
tantiations of the KSs a priori. On the other hand, 
other fundamental features for our applications are 
not explicitely included in Hewitt's model. First 

of all, in the Actor System the receipt of the mes~ 

sages by an actor is controlled by a fair arbiter, 

in order to avoid the starvation. In this way it is 
not possible to:‘control the message flow outside the 
actors and any scheduling strategy must then be in- 
cluded in the actor itself. On the contrary, in our 
case processing of the most reliable hypotheses must 
always be preferred to the other ones, when compe- 
ting with others for the same resource (e.g. the acy 
tivation of the same actor). In fact the starvation 
of a bad hypothesis is not relevant. Thus in the 
said kind of applications, the direct implementation 
according to the Actor System, leaves the job of de- 
signing all the scheduling strategy supports to the 
user, complicating the programming task. 

Another specific feature of our problem is the pre- 

sence of two flows of information, i.e. bottom-up 

and top-down. As the flows may have not the same 
weight in different situations, it is better to ha- 
ve two autonomous control policies. The fundamental 
difference between our model and Hewitt's is. the po- 
licy of the message reception; in particular: 

- A set © of different classes of messages is defi- 
ned. 

- When an actor sends a message (to another), it 
assigns both a class identifier c € © and a prio- 
rity p. | : 

- The actors receive the messages served by an arbi- 
ter which then orders and dispatches them accor- 
ding to a user definible function f£.(c,py(c) ,Q) 
where c is the class identifier, py(c) is the max- 
imum priority of the waiting messages of class c, 
and Q is a parameter settable by the actor. In this 
way it is possible to specify f, as function de~ 
pending on Q in order to dynamically assign a pre- 
ference to the messages of a particular class. 

If a unique class of messages is defined and the 

same priority assigned to all the messages, the ori- 

ginal Actor System fair scheduling is obtained. 

Furthermore, our model can be described in terms of 

Hewitt's model. In fact, the so defined actor can 

be considered as a compound of two A, and Ag actors, 

where A, is a Guardian[8J, which fairly receives the 

messages and then dispatches them to A, 

to the described function f,. 
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We will now describe the application of the compu- 
tation model to the control of the semantic knowled- 
ge source for a Speech Understanding System { 12]. 

The semantic networkl"#Jconsists of a graph, whose 
nodes represent concepts and whose arcs represent 
compatibility conditions among concepts. The graph 
is partitioned into subgraphs (Islands), which are, 
in turn, sets of correlated concepts with direct 
access to the input data. Fig. 1 shows the levelled 
hierarchy of the nodes in the graph: at each level 
the relationships among the nodes at the lower level 
are expressed by means of AND/OR relations [ 3 a 
Fig. 1 shows also the implementation scheme of the 
KS in terms of actors. Each & node (a memory actor 
which knows the AND/OR relations among a set of no- 
des at a lower level) is associated to a Controller 
actor C) , that contains &% in its acquaintances. 
(The acquaintances of an actor A are constituted by 
the set of all other actors which are known by A). 
The motivation for introducing C., is the following: 
the same node may be called during a top-down pro- 
cess (by means of a message belonging to the class 
Messages) or during a bottom-up process (by means 

of a message belonging to the class Stimuli). An 
actor A can communicate with actor B only if it cre- 
ates or is acquainted to B. In this case the two 
previously mentioned strategies would proceed inde- 
pendently, without intercommunication and would du- 
plicate all actors called by both. On the contrary, 
to realize an effective strategy of cooperation and 
exchange of results, for each & node, you introduce 
the Cy controller actor, which is globally known 
and predefined at the time of system initialization. 
CKx receives the calls to % and coordinates the cre- 
ation of the new actors, needed for developing the 
two strategies. We notice that, because of the great 
number of requests which, in general, C receives, 
the controller may make a wide use of the facilities 
introduced in the model: differentiation between 

the bottom-up and top-down processes, flexibility 

of the message receipt scheduling, consent to the 
starvation of some request. In fact this is how the 
controller hinders the proceeding of the bad hypo- 
theses, thus limiting the number of computations to 
be executed. 

Finally, an Initializer actor manages the access to 
the input data. ; 

The actors described in Fig. 1 represent the a pri- 
ori knowledge of the system. But, when actual data 
are processed, other actors will be dynamically (Fig 2) 
created. In fact, when a controller Cq& receives 

a request for verifying a hypothesis, it develops, 
by creating a Producer actor Py , the AND/OR graph 
contained in its acquaintances and produces the ne- 
cessary AND/OR actors. The latter, in turn, will 
send (via IF actors) requests to the controllers of 
lower nodes and so on. The answers will return to 
Cy via the same path. When a Cy controller is 
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stimulated again, it will not repeat the verifica- 
tion process, but send the previously found results. 
The experimental system based on this model is now 
being implemented on a DEC-10 using SIMULA 6 F, 
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